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PREFACE TO THE DOVER EDITION 


We both are very pleased that Dover is reissuing the second edition of A 
First Course in Numerical Analysis. Although it has been 24 years since 
the publication in 1977 of the second edition (which itself was published 
12 years after the first edition), we think that this book still contains 
plenty of material to interest students of numerical analysis. It is true 
that revolutions in computing that have ensued since 1977 have been 
accompanied by large advances in the science and art of numerical 
analysis, but the fundamentals of the subject remain much as they were 
in 1977. Thus, the student who reads this book and, in particular, tack- 
les the problems at the end of each chapter will be well placed to study 
any of the advances that have occurred since 1977. 

A separate manual with hints and answers to most of the problems 
was published to accompany the second edition. That manual, essentially 
unchanged, has been bound into this book. It contains hints or the out- 
lines of answers to almost all of the problems. When a problem—or part 
of a problem—does not appear in the Hints and Answers section, this is 
because a hint or answer was deemed unnecessary or because the space a 
useful hint or answer would have occupied would have been greater than 
was feasible. 

Apart from the inclusion of the Hints and Answers section, this 
edition is unchanged from the 1977 second edition, except that a few 
typographical and other minor errors have been corrected. 

Enjoy! 

Anthony Ralston 
Philip Rabinowitz 


October 2000 


PREFACE TO THE SECOND EDITION 


The 12 years since the publication of the first edition have been a time of 
great progress in numerical analysis. This progress is epitomized by the 
development in recent years of the first complete subroutine packages for 
digital computers—so-called mathematical software—programs which can 
accept as input the parameters of a problem in a particular area of numerical 
analysis and which will generally produce as output the solution to the 
problem within the accuracy desired (or, rarely, a statement that the prob- 
lem is insoluble or not solvable to the desired accuracy) without the neces- 
sity for the user to choose the method of solution. This has been made 
possible by the development of methods—or classes of methods—which 
are responsive to all or nearly all the difficulties or sensitivities which can 
occur in numerical analytic problems. 

It is hardly surprising, therefore, that the first edition is now badly 
out of date in many respects. In this edition we have brought all chapters 
in the book up to date as of 1977 and have deleted material no longer of 
interest because it has been superseded by more modern techniques. Some 
additional material has been deleted solely because of space limitations. 

The scope of the book itself may be easily gleaned from the Table of 
Contents. Here we note only the major additions to and changes from the 
first edition. 


Chapter 1. New sections on Norms, Error Analysis, and Condition 
and Stability; rewritten section on Computer Arithmetic. 

Chapter 2. New sections on Numerical Algorithms, Functionals, and 
the Method of Undetermined Coefficients. 

Chapter 3. New section on Splines. 

Chapter 4. New sections on Adaptive Integration and the Euler Trans- 
formation; rewritten sections on Numerical Differentiation. 


xiii 
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Chapter 5. New sections on Variable Order, Variable Step Methods, 
Extrapolation Methods, and Stiff Equations; revised section on Runge- 
Kutta Methods. 

Chapter 6. New section on the Fast Fourier Transform. 

Chapter 7. New section on the Differential Correction Algorithm. 

Chapter 8. New sections on the Jenkins-Traub Method and a Newton- 
based Method for the zeros of polynomials; rewritten sections on Systems 
of Nonlinear Equations and on the general problem of the Zeros of Poly- 
nomials. 

Chapter 9. New sections on Overdetermined Systems and the Simplex 
Method; rewritten sections on Direct Methods and Error Analysis. 

Chapter I0. New sections on the Inverse Power Method and Jacobi- 
type Methods for nonsymmetric matrices; rewritten section on the OR 
Algorithm. 


In addition, there are many changes in other sections to improve clarity 
and to reflect advances since the publication of the first edition. 

It will ordinarily not be possible to cover all the material in this book 
in a full-year course in numerical analysis. Rather than suggest topics for 
inclusion or exclusion, we would leave this to the instructor’s own taste 
and experience. The fact that the subjects of each of Chapters 3 through 10 
have themselves been the subjects of at least one book apiece should serve 
to emphasize to the student that a course taught from this book is indeed 
a first course in numerical analysis. Moreover, we have not covered such 
topics as the numerical solution of partial differential equations, integral 
equations, or boundary-value problems. These topics properly fall in the 
domain of advanced numerical analysis. Since the basis of much of advanced 
numerical analysis is the solution of systems of linear equations and the 
calculation of eigenvalues, these topics have been purposely placed at the 
end of this volume. 

In each of Chapters 3 through 10 there are a number of illustrative 
examples whose purpose is to enhance the student’s understanding of the 
relevant numerical method. Since a morass of numbers is more likely to 
impede this aim than otherwise, the numbers in these examples have, 
where possible, been kept simple. 

In this edition we have added problems at the end of each chapter 
corresponding to the new and rewritten sections. The problems fall generally 
into four categories: 


1. Simple proofs of topics considered in the text. 

2. Algebraic manipulations and derivations, which would not add materially 
to the understanding of the student if included in the text, but which 
may nevertheless be instructive. 

3. Computational problems. 


PREFACE XV 


4. Proofs and derivations of results which are an extension of the subject 
matter in the text. 


Although the especially difficult problems have been starred (*), the 
student will find few really easy problems. One of the major purposes of 
the separately published Hints and Answers is to help the student solve those 
problems found particularly difficult. The student should be prepared to 
find minor discrepancies between calculated numerical answers and those 
in this manual. These will generally be the result of the idiosyncrasies of 
roundoff error on the computer on which the calculations have been per- 
formed. 

The few bibliographic references in the text itself are to topics outside 
the scope of numerical analysis or to topics not suitable for problems. 
The Bibliographic Notes and Bibliographies at the end of each chapter 
have been brought up to date for this edition. They are meant to guide the 
student to basic sources from which a deeper understanding of the subject 
matter of this book may be obtained. For this reason no attempt has been 
made to make the bibliography exhaustive, and comparatively few foreign- 
language references have been included. 


Anthony Ralston 
Philip Rabinowitz 


NOTATION 


Below is a list of symbols and notation used in this book. Amplified explana- 
tions are given when necessary at the first use of the symbol or notation. 
The reader 1s cautioned that some symbols may have more than one meaning 
but it is hoped that they are unambiguous; for example, P,(x) is used as 
the notation for the Legendre polynomial of degree n in Chap. 4 and there- 
after, but in Chap. 2 this notation is also used for a general polynomial 


of degree n. 


A. Problems and references 


Meaning 


1. References in text to problems at end of each 
chapter: numbers in braces 

2. References to bibliography at end of each 
chapter: name followed by date in parentheses 

3. Problems of more than ordinary difficulty: 
asterisk next to problem number 


B. General mathematical notation 


1. Approximately equal 


2. Binomial coefficient 


xvi 


Symbol or 
example 

{2} 

Feller (1950) 


*10 


Page first 
defined 
or used 


36 


59 


NOTATION xvii 


Page first 
Symbol or defined 
Meaning example or used 
3. Closed interval [a,b] 1 
4. Conjugate transpose of vector or 
matrix : superscript asterisk* v* 445 
. C,| C,| 
5. Continued fraction 292 
|x+D, |x+D, 
6. Derivatives : 
(a) Single, double, or triple prime S(Ey) 1 
(b) Lowercase roman superscript fi(E2) 1 
(c) Letter or number in parentheses f(x) 43 
7. Determinant of a matrix | A| 412 
det(L,- ,) 423 
8. Evaluation of a quantity at a point ln 210 
9. Factorial function x” 141 
10. Functional F(f) 42 
11. Inner product 
(a) T for transpose viv 6 
(b) * for conjugate transpose x*x 487 
12. Integer functions 
(a) Ceiling [x] 15 
(b) Floor |x] 15 
13. Norm 
(a) function: L, If ll 7 
(b) matrix: general A 7 
(c) matrix : Euclidean |All. 8 
(da) vector: L, vil, 7 
(e) vector: Euclidean || v| 6 
14. Open interval (a;, m,) 40 
15. Order of magnitude O(h?) 174 
16. Sequences of functions or numbers: 
indexed quantity in braces (cf. Al above) {x"} 33 
17. Spectral radius of a matrix p(A) 8 
18. Transform pair Gig, 264 
19. Vector: boldface English or Greek letters 
(a) Column v 
(b) Row (T for transpose) vi 6 
20. Vector function f=([fifr- fil 359 
C. Specific mathematical symbols 
Page first 
defined 
Symbol Meaning or used 
1. B,(x) Bernstein polynomial 3 
2. B,(x) Bernoulli polynomial 136 


3. B, Bernoulli number 136 


xviii NOTATION 


Meaning 


Efficiency index 

Error function 

Hermite polynomial 

Identity matrix 

Cauchy index 

Jacobi polynomial 

Lagrangian interpolation polynomial 
Laguerre polynomial 

Legendre polynomial 

Chebyshev polynomial of second kind 
Transpose of matrix or column vector 
Chebyshev polynomial 

Shifted Chebyshev polynomial 

Trace of a matrix 

Computed solution of linear system 
True solution of linear system 
Backward-difference operator 
Gradient operator 
Forward-difference operator 
Central-difference operator 
Kronecker delta 

Gamma function 

Mean central-difference operator 

Set union 


Page first 
defined 
or used 
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CHAPTER 


ONE 
INTRODUCTION AND PRELIMINARIES 


1.1 WHAT IS NUMERICAL ANALYSIS? 


That numerical analysis is both a science and an art is a cliché to specialists 
in the field but is often misunderstood by nonspecialists. Is calling it an art 
and a science only a euphemism to hide the fact that numerical analysis is not 
a sufficiently precise discipline to merit being called a science? Is it true that 
“numerical analysis” is something of a misnomer because the classical 
meaning of analysis in mathematics is not applicable to numerical work? In 
fact, the answer to both these questions is no. The juxtaposition of science 
and art is due instead to an uncertainty principle which often occurs in 
solving problems, namely that to determine the best way to solve a problem 
may require the solution of the problem itself. In other cases, the best way to 
solve a problem may depend upon a knowledge of the properties of the 
functions involved which is unobtainable either theoretically or practically. 
A simple example will illustrate this. Two common methods for estimating 


f (x) dx 


are the trapezoidal rule and the parabolic rule. The error incurred, i.e., the 
difference between the true value of the integral and the approximation, in 
the former is —(b — a)*f”(€,)/12n?, where n + 1 is the number of points at 
which we evaluate f(x) in [a, b] and &, is some (unknown) point in [a, b]. 
For the parabolic rule the error is — (b — a)°fi"(E,)/180n*, where again n + 1 
is the number of points at which f(x) is evaluated and €, is an unknown 
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point in [a, b]. Which do we use, especially when f(x) is such that its deriva- 
tives are not reasonably calculable? As numerical analysis is a science, it has 
provided us with these two methods and the errors incurred when we use 
them, but as it is an art, it requires us to use our intuition, experience, and 
knowledge of functions “like” f(x) to choose that method best suited to our 
particular problem. 

As a science, then, numerical analysis is concerned with the processes by 
which mathematical problems can be solved by the operations of arithmetic. 
Sometimes this will involve the development of algorithms to solve a prob- 
lem already in a form in which the solution can be found by arithmetic 
means, e.g., simultaneous linear equations. Often it will involve replacing 
quantities which cannot be calculated arithmetically, e.g., derivatives or inte- 
grals, by approximations which permit an approximate solution to be found. 
In this case, we shall naturally be interested in the errors incurred in our 
approximation. But in any case, the tools we shall use in developing the 
processes of numerical analysis will be the tools of exact mathematical 
analysis as classically understood. 

As an art, numerical analysis is concerned with choosing that procedure 
(and suitably applying it) which is “ best” suited to the solution of a particu- 
lar problem. This implies the need for anyone who wishes to practice numer- 
ical analysis to develop experience and with it—it is hoped—intuition. We 
have therefore provided numerous examples to illustrate the numerical 
methods discussed. The purpose of these examples is to help the reader 
understand principles and develop an insight into computational processes. 
To further these aims, the numbers, where permissible, have been kept 
simple. 

Numerical analysis is a very different discipline today from what it was 
25 years ago at the advent of the high-speed digital computer. High-speed 
computation has revolutionized numerical analysis as an art and given enor- 
mous impetus to its development as a science. Our orientation in this book 
will be entirely toward methods which are particularly useful on digital 
computers. This does not mean ignoring either traditional desk calculators 
or modern hand-held electronic calculators, including those which are 
programmable. Most of the methods to be discussed are applicable on hand- 
held and desk calculators as well as on digital computers, but we recognize 
that despite the rapidly increasing popularity of hand-held calculators, the 
overwhelming majority of significant numerical computations are per- 
formed on digital computers. 


1.2 SOURCES OF ERROR 


Numerical answers to problems generally contain errors which arise in two 
areas: those inherent in the mathematical formulation of the problem and 
those incurred in finding the solution numerically. The former category 
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includes the error incurred when the mathematical statement of a problem is 
only an approximation to the physical situation. Such errors are often negli- 
gible, as in the case of neglecting relativistic effects in problems in classical 
mechanics. If they are not negligible, then no matter how accurate the 
numerical computations, there will be a significant error in the result. 
Another source of inherent error is the inaccuracies in the physical data. Such 
errors are also generally negligible when they are caused by inaccuracies in 
physical constants, e.g., the gravitational constant. But when they are the 
result of errors in empirical data, the worth of a computed solution must be 
carefully weighed against these errors. Moreover, because such errors are 
usually random, treating them analytically may be quite difficult. Such 
errors will play significant roles in Chaps. 6 and 9. 

There are three main sources of computational error. The first, familiar 
to all users of desk calculators or pencil and paper and to all programmers, 
is the gross error, or blunder. Digital computers have enormously reduced 
the probability that calculational blunders will occur, but of course the 
possibility of a programming blunder which results in the correct calculation 
of the wrong result is always present. In addition, undetected bugs in the 
compiler or other system software are not very unusual. When the cor- 
rectness or reasonableness of a computed solution cannot readily be verified, 
the possibility of a blunder should not be ignored. It is the two other sources 
of computational error which will chiefly interest us here, however. 

The first of these sources is that caused by solving not the problem as 
formulated but rather some approximation to it. This is usually caused by 
the replacement of an infinite, i, summation or integration, or 
infinitesimal, i.e., differentiation, process by a finite approximation. Some 
examples of this are: 


1. Calculation of an elementary function, e.g., sin x, by using the first n terms 
of its infinite Taylor-series expansion 

2. Approximation of the integral of a function by a finite summation of 
functional values, as in the trapezoidal rule 

3. Solution of a differential equation by replacing the derivatives by approx- 
imations to them, e.g., difference quotients 

4. Solution of the equation f(x) =0 by the Newton-Raphson method, an 
iterative process which in general converges only in the limit as the 
number of iterations goes to infinity 


We shall denote this type of error in all its various forms as truncation 
error, since it often is the result of truncating an infinite process to get a finite 
process. In all the numerical procedures considered in this book, we shall be 
interested in estimating, or at least bounding, this error (to know it would, of 
course, be to eliminate it!). 

The other source of error of importance to us is that caused by the fact 
that arithmetic calculations can almost never be carried out with complete 
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accuracy. Most numbers have infinite decimal representations which must 
be rounded. But even if the data of a problem can be expressed exactly by 
finite decimal representations, division may introduce numbers which must 
be rounded and multiplication may produce more digits than can rea- 
sonably be retained. The error we introduce by rounding a number is called 
roundoff error. Like the errors in empirical data, roundoff error has a 
random character which makes it difficult to deal with. We shall consider 
these difficulties in more detail in Sec. 1.4. 


13 ERROR DEFINITIONS AND RELATED MATTERS 


In the previous section, we relied upon the fact that the reader undoubtedly 
has a good intuitive notion of error. Here we shall formalize the concept of 
error. The two basic definitions are 


Definition 1.1 
True value = approximate value + error 


Definition 1.2 
error 


Relative error = ——————_ 
true value 


Thus, denoting error by E and relative error by RE, if we approximate 4 by 
.333, we have 
E=4x10°% and RE=1073 


Generally, we shall be interested in E (which is sometimes called absolute 
error) rather than RE, but when the true value of a quantity is very small or 
very large, relative errors are more meaningful. For example, if the true value 
of a quantity is 10'°, an error of 10° is probably not serious, but this is more 
meaningfully expressed by saying that RE = 107 °. In actual computation of 
the relative error, we shall often replace the unknown true value by the 
computed approximate value. 


1.3-1 Significant Digits 


Let x be a real number which in general has an infinite decimal representa- 
tion. We shall say that x has been correctly rounded to a d-decimal-place 
number, which we denote by x™, if the roundoff error « is such thatt 


le] = |x —x] <4 x 1074 (13-1) 


+ When |e] =4 x 107%, we have the choice of rounding “up” or “down.” That is, the 
number .2775500 ... rounded to four decimal places may be either .2776 or .2775. Commonly, 
such numbers are always rounded “up” or always “ down,” but, since it is usually desirable to 
avoid any bias in roundoff, a rule such as rounding so that the last digit is always even (or 
always odd) is desirable in lengthy computations. 
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Thus, if x = 6.74399666 ..., x) = 6.744 and x" = 6.7439967. 
If y is any approximation to a true value x, then the kth decimal place of 
y is said to be significant if 


|x -—y| <4 x 107" (1.3-2) 


Therefore, every digit of a correctly rounded number is significant.t 

It might seem natural now to define the number of significant digits in y 
to be all the digits of y satisfying the definition of significant digit. Thus, if 
y, = 9863 is an approximation to x, = .98632 and if y, = .0028 is an 
approximation to x, = .00278, we would say y, and y, both have four 
significant digits. But are y, and y, equally “significant”? If they are the 
final answers to a computation and we are interested in absolute error, then 
they certainly are. But if they are intermediate numbers in a calculation 
which are to be used later as divisors, then they most certainly are not: for 
the magnitude of the error in 1/y, is much less than that in 1/y, {2}. We shall 
therefore avoid the use of the notion of the number of significant digits in a 
number. 

The example above indicates that numbers with leading zeros can 
cause substantial magnification of error when used as divisors. In numerical 
calculations subtraction (or addition of numbers of opposite sign) is the 
most common source of numbers with leading zeros. Suppose all the digits 
in y, = 2.78493 and y, = 2.78469 are significant. Then the error in 
¥1 — y2 = 00024 is in magnitude at most 107° (the sum of the maximum 
magnitudes of the errors in y, and y,). But, if y; — yz is later used as a 
divisor, the magnitude of the error will be greatly increased. Note that the 
relative error in y, is less than 4 x 107 °/2.784925 ~ 1.8 x 107°, while that 
in y,; — y2 may be as much as 107 °/.00023 ~ 4.3 x 10° 2. Thus we can say 
that if the sum or difference of two numbers causes a large increase in 
relative error, then if this result is later used as a divisor, substantial 
magnification of error may occur. The phenomenon, often called subtractive 
cancellation, is one of which the numerical analyst must always be aware. 


13-2 Error in Functional Evaluation 


An important general problem is that of estimating the error in the evalua- 
tion of f(y,, ..., y,) Where 


Vi = Xi — & (1.3-3) 


} It is not satisfactory to define the kth digit of y as significant only if x and y are identical 
when rounded to k digits. For suppose x = 3.76512 and y = 3.7648. Then the 6 would not be 
significant, but the 4 would be. By our definition, both are significant. 
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with x; the true value of the ith variable and ¢; the error. Then 


— . of 
Fens %) LO m= Deg 


2 


+3(Se5-} for (13-4) 


where the partial derivatives are to be evaluated at (y,, ..., y,). Since we shall 
usually be able to assume that terms containing products of the errors are 
small compared with the first term on the right-hand side of (1.3-4), we can 
write 


Fl os L000 ® Deas (13-5) 


Equation (1.3-5) can serve as a convenient tool in estimating errors in func- 
tional evaluation, although in special cases more terms in (1.3-4) may need 
to be retained. In Secs. 1.5 and 1.6 error analyses for some special cases of 
f (V1, ---5 Yn) are considered. 


1.3-3 Norms 


In dealing with vectors, matrices, and functions, the problem arises of meas- 
uring their size in some convenient form. This is usually done by means of a 
norm, a real-valued function with properties which generalize the usual 
Euclidean concept of length. 

If we consider a vector v = (v,,..., v,)",¢ then the Euclidean length of v, 
which we shall for the moment denote by ||v|| (read the norm of v), is given 
by 


n 1/2 
lv] = (v"v)"/? = ( » | (1.3-6) 
i=1 
This norm has the following properties: 


Property 1 ||v|| > 0 and |v|| = 0 if and only ifv = 0; that is, v, = 0,i = 1, 
veuy 


Property 2 ||av|| = |a| - |/v|| for any real (or complex) number «. 


Property 3 |ju + v|| < jul] + ||v||, the triangle inequality. 


+ The superscript T applied to any vector or matrix denotes the transpose. All vectors are 
assumed to be column vectors so that a row vector is represented as the transpose of a (column) 
vector. 
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Properties 1 and 2 are immediately obvious, while Property 3 is a con- 
sequence of the Cauchy-Schwarz inequality {6}. 

The norm defined by (1.3-6) is not the only function which has the above 
three properties. In fact, any function which has these properties is called a 
vector norm. Thus, a generalization of (1.3-6) is the L, norm 


iMle= (Slo) (13-7) 


defined for any p > 1. If we let p tend to infinity, we get {6} the maximum, or 
L,, norm 
lvl. = max |»,| (1.3-8) 


l<si<n 


A further generalization yields the weighted L, or L,, , norm 


Ine = (Solo?) (13.9) 


where w = (w,, ..., w,)? is a vector of positive weights w;, i= 1,..., n. The 
proof that these are indeed norms is left to the reader {6}. The most 
frequently used norms in practice are the L,, L,, and L,, norms and their 
weighted counterparts. 

For functions f(x) continuous on a finite interval [a,b], any real 
functional (see Sec. 2.3) which satisfies properties 1 to 3 is a norm. Thus, 
corresponding to the L, vector norm, we have the L, norm for functions 
given by 


b 1/p 
Ifl>= { | f(x)? ax| (1.3-10) 
and corresponding to the maximum norm, we have the L,, , or uniform, norm 
fll. = max | f(x)| (1.3-11) 
asx<b 


The counterpart to the weighted L, norm for functions is 


IF lve | mei sceyp ae] (13-12) 


where w(x) is a nonnegative weight function which does not vanish on some 
subinterval of [a, b]. 
Returning now to vector norms, we can use them to define norms on a 
matrix A as follows: 
|| A|| = max || Ax| (1.3-13) 
Il xi] = 1 
Such a norm is called a subordinate or induced norm. That this norm has 
properties 1 to 3 is immediate {7}. In addition, for square matrices, we have 
from (1.3-13): 
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Property 4 
JABl < |All - |B] (1.3-14) 


In view of this property of subordinate norms, we shall define a matrix 
norm as a norm which satisfies Property 4. For the L, and L,, vector norms, 
the corresponding subordinate matrix norms are given respectively by {7} 


|All, = tase > | a,,| (1.3-15) 
and l}.o = max y |a,;| (1.3-16) 
sSisa j 


so that ||Al|, = |/A7||... The computation of the matrix norm induced by 
other vector norms, including the L,, or Euclidean, norm (1.3-6), is quite 
difficult. 

From the definition (1.3-13), it follows {8} that for any vector norm and 
the corresponding subordinate matrix norm, 


Axl] < [4] - Ix (1.3-17) 


for all matrices A and vectors x. Any matrix norm for which (1.3-17) holds 
with respect to a given vector norm is said to be consistent or compatible 
with that vector norm. 

An example of a matrix norm which is not a subordinate norm is given 
by the Euclidean matrix norm (sometimes called the Frobenius norm) 


| Alle = (> Sat) (1.3-18) 


That this is indeed a matrix norm is quite simple to prove {7}. And that it is 
not a subordinate norm follows from the fact that for all subordinate norms, 
||| = 1, where J is the unit matrix, whereas ||J||, = ./n. Thus the Euclidean 
matrix norm is a generalization of the Euclidean vector norm but is not 
subordinate to it. However, it is consistent with the Euclidean vector norm 
{8}. In addition, we have that {8} 


JAll2 < [Alle sn!" All (1.3-19) 


Closely related to the notion of matrix norm is that of the spectral radius 
p(A) of a matrix, defined by 


p(A) = max |A,| (1.3-20) 
lsisn 
where A, are the eigenvalues of A. It is easy to show {9} that for any consist- 
ent matrix norm 


p(A) < Al (1.3-21) 
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Further, we have, using the definition of matrix norm (1.3-13) and Theorem 
10.6, that 


|Al]2 = 0(A7 4) (1.3-22) 
From this it follows that if A is a symmetric matrix, then 
|All2 = p(A) (1.3-23) 


For this reason, the L, matrix norm is also called the spectral norm. 

The use of norms is very important in measuring the quality of a 
computed approximation v, to a desired true result v,. For any norm, 
lv. — v, || is a measure of the deviation of the approximation. We shall use 
this approach in Chap. 9, where some additional results on norms are given. 


1.4 ROUNDOFF ERROR 


If the solution of a problem requires many thousands or even millions of 
arithmetic operations, each of which is performed using rounded numbers, it 
is intuitively clear that the accumulated roundoff error may significantly 
affect the result. It is true that, given the computation to be performed and 
the roundoff rules to be applied, the roundoff error at each step ts 
determined. Thus roundoff is not a random process. But a priori it is essen- 
tially impossible to determine what the roundoff error will be in the millionth 
or even the hundredth step of the computation. Therefore, a probabilistic 
approach to roundoff error is helpful in understanding how this error 
accumulates. 


1.4-1 The Probabilistic Approach to Roundoff: A Particular Example 


Consider the addition of m numbers {a,} each correctly rounded to d decimal 
places. Since the error in each number is no greater than 4 x 107%, the 
accumulated error in the sum is no greater than n/2 x 1074. But since most 
errors will be less in magnitude than 4 x 1074, and since the differing signs in 
the errors will cause some cancellation of error, we expect the accumulated 
error to be substantially less than n/2 x 10~4 in general. Our object here is 
to consider just what the distribution of this error 1s. 

Denote by e,; the roundoff error in a;, and for convenience, let us, with- 
out loss of generality, take d = 0 in what follows. Then e,; takes on values in 
the interval [—4, 4], and if we assume each value in this range is equally 
likely, then the probability density of €, is shown in Fig. 1.1, which indicates 
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Figure 1.1 Graph of p,(x), the probability 
density of ¢;. 


that 
1 1 
X~—X, -—_ZSX15;X%2S3 
1 1. 1 
Xp+4 X1< -33 -3 5%. <4 
Prix, <6, <x2)= 1 ley ef. 1 
2—X1 —3 SX, S955 X%2>3 
I X1< —$3X2>4 


Now let us consider what happens when aq; is added to a,, assuming that the 
roundoff errors in the two numbers are independent. Then {11} the probabi- 
lity density of the error ¢;; = €; + €; in a; + a; 1s shown in Fig. 1.2. 

From Fig. 1.1 we can calculate the variance a? of p,(x) as 


2 _ | x? dx = 7, (1.4-2) 
1/2 
and from Fig. 1.2, the variance of p;,(x) ist 
0 1 
03, = | x?(1 + x) dx + | x2(1—x)dx =% (1.4-3) 
-1 0 


Corresponding results for the sum of three {11} and four numbers would 
suggest a result which is only hinted at by the results for one and two 
numbers, namely that in the limit as n > oo, the density function for the sum 
of n numbers approaches the normal density function with zero mean and 
variance no? with o; given by (1.4-2). Thus, if ¢ is the accumulated roundoff 
error for n additions, 


I — 6x2/n 

Prix <¢<x + dx) (eine oxtin dx (1.4-4) 
as n—> oo, where the function on the right-hand side of (1.4-4) is the normal 
density function with zero mean and standard deviation o = (n/12)'/?. This 
result can be derived rigorously using the central-limit theorem of probabi- 
lity theory [see Feller (1950), pp. 191ff.], which states that the distribution of 
a sum of n mutually independent random variables with a common distribu- 
tion approaches a normal distribution as n > 00 with mean equal to that of 
the common distribution and variance equal to n times the variance of the 


+ Or we can calculate o,, using the result that the variance of the sum of independent 
distributions with zero mean is the sum of the variances. 
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Figure 1.2 Graph of p;,(x), the probability 
density of €,;. 


common distribution. [When d # 0, the only changes that must be made are 
to multiply all variances by 107 74 and to multiply x and dx on the right- 
hand side of (1.4-4) by 107] 

In actual fact the normal density function gives a very good approxima- 
tion to the distribution of €, + --- + €, for quite small n; so let us suppose 
that we can use the right-hand side of (1.4-4) as the probability density for 
the error in our sum of n numbers. The probable error, which is defined as 
that positive value of x for which the probability is 4 that the magnitude of « 
will exceed x, is given by 


4 
61450 = 2! NL. (1.4-5) 


75 


Thus, while the error cannot exceed n/2 in magnitude, the probable error is 
proportional to the square root of n. For the sum of n numbers, then, the 
probable error is substantially less than the maximum error, for lengthy 
computations, 1.e., large n, by an order of magnitude or more. This is a 
general result which we would intuitively expect to hold also for more 
complicated computations where the analysis of roundoff is too complex to 
carry out. 

This result leads to what seems like a paradoxical situation in error 
analysis. On the one hand, in numerical computations we are usually in- 
terested in bounding the error incurred. That is, when we finish a computa- 
tion, we like to be able to say that the computed result differs by no more 
than the error bound from the true result. In deriving such error bounds, we 
shall then, for the roundoff component of the error, choose its maximum 
value. On the other hand, we now realize that by so doing we are generally 
being unduly conservative. This paradox can be resolved only in the context 
of a particular situation. If roundoff error is small compared with truncation 
error, then using its maximum value will not make the error bound unrea- 
listic. But if roundoff is the dominant error, the unrealistic error bound may 
have to be replaced by an estimate—generally a conservatively high 
estimate—of the expected error in order to get a usable result. Because Jn 
appears in the probable error while n appears in the maximum error in the 
above example, a common way to estimate the roundoff error in a long 
computation is to find a bound on the error and then to replace n (where n is 
a measure of the number of computations) by Jn 
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A problem closely related to that considered above is forming the 
weighted sum of the n numbers 


y 0; a; (1.4-6) 


If the roundoff errors in the a,’s all have the density function shown in 
Fig. 1.1, the result, analogous to the one above, is that the density function of 
the error in the sum approaches a normal density function with zero mean 
and variance 


yo? (1.4-7) 


Linear combinations of the form (1.4-6) occur often in numerical analysis, as 
we shall see. The coefficients «; may often be chosen arbitrarily except for a 
constraint of the form 


Ya, =a>0 (1.4-8) 
i=1 


An obvious question then is: What values of the «; lead to the best roundoff 
behavior in the sum? The answer depends on what we mean by “ best.” If we 
only wish to minimize the bound on the roundoff error, then all sets of a; 
which are all positive are equally good. But if «, = « and all the other «; are 
zero, then the worst case will often be realized. A more reasonable definition 
of “best” would be that set of «; which minimizes o? in (1.4-7), since this will 
lead to good roundoff behavior on the average (why?). Minimizing o? sub- 
ject to the constraint (1.4-8) is a problem easily solved using the Lagrangian- 
multiplier technique {15}. The answer is that the «; are all equal to «/n. 
Therefore, the roundoff properties of any sum of the form (1.4-6) can be 
judged by comparing the sum of the squares of the «, with «?/n. 


1.5 COMPUTER ARITHMETIC 


The example in the previous section was idealized in the sense that it dealt 
with numbers without regard to the number of digits in each number. On 
digital computers, of course, there are restrictions on the number of digits in 
each number. Therefore in order to perform roundoff-error analyses of 
numerical alogorithms implemented on digital computers, it is first neces- 
sary to understand how computers perform arithmetic. This is the subject of 
this section. 


1.5-1 Fixed-Point Arithmetic 


One mode of computer arithmetic, fixed-point arithmetic, is quite similar to 
ordinary pencil-and-paper arithmetic. It deals with numbers expressed in the 
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familiar way as a sequence of digits (although these digits are almost always 
binary digits rather than decimal ones). The crucial difference from ordinary 
arithmetic is the assumption that the (binary or decimal) point is fixed, 
usually at the left-hand end of the number, 1.e., all fixed-point numbers less 
than 1 in magnitude, but on some computers at the right-hand end, i.e., all 
numbers integers. The problems with using fixed-point arithmetic in nontri- 
vial calculations are twofold: 


1. Since the actual numbers ones deals with are seldom so well-behaved as 
to allow one to deal entirely with numbers less than 1 in magnitude (or 
with only integral numbers), the user of the computer, i.e., the program- 
mer, must arrange to keep track of an imaginary point which represents 
the correct position of the point. Particularly troublesome, for example, is 
the need for lining these imaginary points up correctly when adding two 
numbers. 

2. Equally annoying is the necessity for assuring that the result of any 
arithmetic operation is, from the point of view of the computer, a valid 
number in the fixed-point system. For example, if all numbers are less 
than 1 in magnitude, the sum .5 + .6 cannot be performed and is said to 
overflow. Protection against this phenomenon must be achieved by ap- 
propriately shifting the two arguments, a process called scaling, which can 
be extremely tedious in large calculations. 


For these reasons fixed-point arithmetic is only very rarely used in nontrivial 
calculations on computers. We shall consider it no further in this book but 
shall instead assume that all calculations are performed in floating-point 
arithmetic. 


15-2 Floating-Point Numbers 


Numbers in a computer are typically of a standard length |, which includes 
the sign of the number and its digits (but see also Sec. 1.5-5). In floating- 
point arithmetic | consists of three parts: 


One bitt s for the sign of the number 
p bits for the fractional part, or mantissa, m 
q (=1—- p-— 1) bits for the exponent e 


as shown in Fig. 1.3. In this representation the number is interpreted by the 
computer as 


+m x 2°-4= +m x 2 (1.5-1) 


+ Hereafter we shall assume a binary representation of numbers in a computer and shall use 
the standard contraction bit for binary digit. 
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Sign Exponent Usual Mantissa 
position of 
binary point 


Length of number £=p+qt 1. 


Figure 1.3 Floating-point number representation. 


where d, the displacement, is chosen so that the exponents have a range from 
—d (e =0) to 24—d — 1 (e = 24— 1). To make this range approximately 
symmetric about zero, d would normally be chosen equal to 27” ', so that the 
range of exponents would be from —2?%~! to 24”! — 1. In what follows we 
shall replace e — d with f, which we shall assume has a set of possible integer 
values approximately symmetric about 0, and, for convenience, we shall 
igniore the adjustments to the exponents which must be made because of d. 

In most floating-point calculations the mantissas are normalized, which 
means that the most significant, i.e., the first, bit is a 1.+ Thus, if the binary 
point of the mantissa is assumed to be at its left end (in a few computers it is 
assumed to be at the right end), 


5<|m| <1 (1.5-2) 


In the analysis which follows we shall assume that floating-point numbers 
have the form (1.5-1) and satisfy (1.5-2). 

Before considering arithmetic in floating-point numbers we note one 
property of such numbers which is often significant in numerical calculations 
(see, for example, Example 2.2 in Sec. 2.2). This property is the density of 
these numbers on the real line. Suppose p, the number of mantissa bits, 1s 24. 
Then on the interval [0, 1) (exponent part f equals 0), there are 274 
floating-point numbers equally spaced 1/274 apart. Similarly on any interval 
(2%, 24+) there are 2? equally spaced numbers, but their density is 24/27. 
Thus, for example, between 27° = 1,048,576 and 27! = 2,097,152 there are 
224 = 16,777,216 floating-point numbers, but the spacing between successive 
numbers is 7. The particular lesson to be learned from this is that when 
comparing two nearby large floating-point numbers, e.g., in testing the 
convergence of an iterative process, it is almost always advisable to compare 
the difference relative to the magnitude of the numbers. 


+ On many computers which are essentially binary, floating-point numbers are nevertheless 
interpreted in hexadecimal (base 16). In this form (1.5-1) and (1.5-2) become 


tmx1l66 4k<|m| <1 
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1.5-3 Floating-Point Arithmetic 


We assume that the reader is sufficiently familiar with computers to under- 
stand why floating-point representation avoids the problems considered in 
the section on fixed-point numbers. In this section we shall analyze the 
errors inherent in arithmetic using floating-point numbers, first by consider- 
ing the four arithmetic operations and then with an example of a larger 
computation. 


The basic operations. We consider the result of operating on two I-digit 
normalized floating-point numbers x and y to produce an /-digit result, 
which we denote by 


f(x op y) 


where op is +, —, *, or /, We assume in each case that the mantissa of the 
result is first normalized and then rounded.f The rounded value m, to p bits 
of a mantissa m of more than p bits is defined as 


2-7|12"m +43) m>O0 


2-"12"m—41  m<0 (15-3) 


m, = 
where the floor function |x| is the largest integer less than or equal to x and 
the ceiling function [xis the smallest integer greater than or equal to x. For 
positive numbers (1.5-3) expresses the familiar rule of adding 1 in the 
(p + 1)st position. Considering the mantissa only, rounding therefore results 
in an absolute error bounded by 


Je] < 27-7? (1.5-4) 


and a relative error bounded by 27?~'/5 = 27”. 

Now we consider briefly algorithms for the four arithmetic operations 
and the resultant relative errors. We begin with multiplication and division 
because in floating point they are much easier to understand than addition 
and subtraction. We assume the existence of a 2p-bit accumulator although 
in fact p + 2 bits are sufficient {19}. 


Multiplication The exponents are added, and the mantissas are multiplied. If 
the resulting mantissa is unnormalized, it is normalized and the exponent is 
adjusted. Then the mantissa is rounded to p bits. 

To analyze the error let 


x=m,25* y=m,2” (1.5-5) 
Then xy=m,m, x 2th (1.5-6) 
where 4< |m,m,| <1 (1.5-7) 


+ The rounding may cause overflow, which then requires a renormalization. 
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since m, and m, are both assumed to satisfy (1.5-2). Therefore, normalization 
of mm, involves a left shift of at most 1. The rounded mantissa of f I(x * y) is 
then either 


m,m+e or 2mm, +e (1.5-8) 


with ¢ satisfying (1.5-4). Thus 


(m,,m, + €)2f**F» |m,m,| >4 
I(x * y)= y _ ” 
file * y) (2mm, + 6)2tFr! = > |m,m,| =F 
€ 
1 + |m,.m,| >4 
=m,m, x 2!**F» x meily 
€ 
1+ Fam, 4> |m.m,| >4 
=x * p(il+e,) (1.5-9) 
where 
lem] <2]e| < 27 (1.5-10) 


Our result is that the computed product is the exact product times a factor 
1 + ¢,, with ¢€,, bounded as in (1.5-10). Thus the bound on the relative error in 
multiplication is the same as the bound on the relative error in rounding a 
number. 


Division Here the exponents are subtracted; one-half the numerator man- 
tissa is divided by the denominator mantissa (to avoid quotients greater 
than 1), and the result is normalized and then rounded. Proceeding as above 
and assuming y # 0, we have 


* = mx/2 5 5,-sy+1 (1.5-11) 

y my 
ith < * 1 1.5-12 
wi bs zt |< (1.5-12) 


Then by an analysis precisely similar to that above 


si(*) = ; (1 + €,) (1.5-13) 


where leg] <27? (1.5-14) 


Addition and subtraction Here the mantissa of the operand of smaller 
magnitude, say y, is shifted f, — f, places to the right, the resulting man- 
tissas are added (or subtracted), the result is normalized (by shifting right if 
the result overflowed), f, is adjusted accordingly, and then the result is 
rounded. Thus 


xty=(m, +m, x 27 5*)2F (1.5-15) 
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where if the true sum or difference is nonzero, 
2°? < |m, +m, x 2 F*| <2 (1.5-16) 
Normalization consists of a left shift of t places, where 
—-l<t<p-1 (1.5-17) 
with — 1 corresponding to a right shift of 1. In general t is such that 
2-1 < |m, +m, x 2 F*| < 27! (1.5-18) 
Then f(x +y)=[(m, +m, x 2/7 4*)2' + ef2t=! 
=(x+ vi + ew 
= (x + y)(1 + «,) (1.5-19) 


where, using (1.5-18), 
le,] <2]e] <27? (1.5-20) 


In all the floating-point arithmetic operations, therefore, the relative 
error in the result has a magnitude no greater than 1 in the least significant 
position of the mantissa. 

It is important to note that if, instead of our assumption of the existence 
of a 2p-bit accumulator, there were only a p-bit accumulator, our results 
would be quite different. In particular, instead of (1.5-19) and (1.5-20) we 
would have {20} 

fix + y)=(x + y+ @) (1.5-21) 
with le] <3 x27? (1.5-22) 
Error analysis of a floating-point computation. Consider the sum 
Vix = Xp bX. to +X (1.5-23) 
i=1 


Suppose first that we add successive terms and use the results of the previous 
section. Define 


31 =X 
s, =f\(s,-, + x,)=(s,-1+x,)(1+e) r=2,...,0 (1.5-24) 
with |e,| < 27". Then 
Sy = (S,—1 + Xq)(1 + &) 
= [(Sp—2 + Xn—1)(1 + €n—1) + Xn](1 + €n) 


= x,(1 +1) + x2(1 + 42) +-°> + x,(1 + 1) (1.5-25) 
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where 
l+nyn,=(l+e)\1+ea:) Ute) r=2,...,0 


M1 = 2 (1.5-26) 
Therefore 


(1—2-y ot} <l4y<(1427°f "7"! r=2,...,n (1.5-27) 


and finally fi « = (5 2) — (1.5-28) 


From this result we note the following: 


1. If the true sum is small relative to the x;’s, the relative error can be large. 
This is as we would expect since this case corresponds to subtractive 
cancellation as the sum is developed. 

2. Since the bounds of n; decrease with i, the smallest upper bound on the 
relative error is obtained if the numbers are added in increasing order of 
magnitude (although in practice this does not necessarily give the smal- 
lest error {21}). 


Now suppose in computing (1.5-23) that the double-precision (2p-bit) 
mantissas obtained at each stage are not rounded after normalization but 
instead are used directly in the next stage. Assuming that the accumulator 
can accommodate only 2p bits, the error in each addition using (1.5-22) with 
p replaced by 2p is such that 


Sp = X,(14+4,) + x2(1+ 7.) +°°°+x,(1 + 7,) (1.5-29) 
with 
(1-3 x2 2yprrtl<t4Ay<(1+3 x2 77yP") r=2,...,n 
1 =o (1.5-30) 


When we compare (1.5-27) and (1.5-30), the advantage of retaining the 
double-precision mantissa is clear since the bound on #, is far smaller than 
that on n,. Even if the double-precision s, in (1.5-29) is then rounded to 
single precision, the result is only to multiply the right-hand side of (1.5-29) 
by 1 +7 with n bounded by 27”. 

Whereas the calculation of sums like (1.5-23) is not very common, the 
calculation of inner products 


DXi =X H+ Xan (1.5-31) 
i=1 
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is very common, particularly, in numerical linear algebra. Reasoning similar 
to that above implies the great desirability of using double-precision pro- 
ducts x; y; in computing the sum (1.5-31) {22}. 

More generally readers will have been well served by this section if they 
recognize that for significant floating-point computations a careful error 
analysis is both worthwhile and nontrivial. 


1.5-4 Overflow and Underflow 


The discussion above ignored the possibility that the result of any floating- 
point operation will not be representable in the floating-point representation 
scheme of the computer. But this can occur as the result of any arithmetic 
operation in floating point. 

The magnitude of the largest number which can be expressed in the form 
(1.5-1) is 


M x 2F (1.5-32) 


where F is the largest positive exponent (usually 2%” — 1)and M=1-—2°° 
is a mantissa all of whose bits are 1. Floating-point overflow results when the 
result of a floating-point operation has a magnitude greater than that in 
(1.5-32). This can happen with any arithmetic operation. For example, with 
q=8 and therefore F = 128 — 1 = 127, multiplying 4 x 27° and 4 x 2°° 
gives a result greater than (1.5-32). Similarly the difference of 4 x 2'?’ and 
—4 x 2'?7 also overflows. 

Underflow results when the result of a floating-point operation is non- 
zero but too small to be expressed in the form (1.5-1). If numbers are 
required to be normalized, the smallest number expressible in the form 
(1.5-1) is 3 x 27", where — F is the most negative exponent allowed, usually 
—2%-! With q = 8 this is — 128. Then, for example, 4 x 2~ °° divided by 
4 x 25° is a number too small to be expressed. Similar examples could be 
given for the other three operations. If nonnormalized numbers are allowed, 
the smallest nonzero number in magnitude is smaller than in the normalized 
case but underflow is still possible with any arithmetic operation {23}. 

Overflow is almost always the result of an error in the computation. For 
underflow, however, it is usually sufficient to replace the result by 0 although 
there are exceptions to this. 


1.5-5 Single- and Double-Precision Arithmetic 


Arithmetic performed on either fixed- or floating-point numbers of length / 
is called single-precision arithmetic. The example in Sec. 1.5-3 indicated an 
instance where it is desirable to retain intermediate values of length greater 
than /. In some calculations it is desirable to be able to use numbers of length 
21 throughout or in some significant part of the calculation. Such arithmetic 
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is called double-precision arithmetic, and the numbers are called double- 
precision numbers. 

Many computers on which the standard length of numbers for arith- 
metic is | have machine-language instructions for doing double-precision 
arithmetic at least on floating-point numbers. On computers without this 
facility the effect of double-precision arithmetic can be obtained by suitable 
programming. Occasionally triple- or even higher-precision arithmetic 1s 
necessary to achieve a needed level of accuracy. Such higher precision is 
almost always accomplished by programming. 

The relationship of double-precision floating-point numbers to single- 
precision floating-point numbers varies somewhat from computer to 
computer. But, using our previous notation, the most usual arrangement is 
to have a (p + /)-bit mantissa and, as with single precision, a qg-bit exponent. 


16 ERROR ANALYSIS 


The intuitive approach to error analysis is to start from the errors in the 
initial data and take into account the successive roundoff errors as the 
calculation proceeds in order to compute estimates of, or bounds on, 
the error in the result. For example, if x,, x, and x3, may each have an error, 
€1, €2, or e, bounded, respectively, by €,, €,, and €3, then, using overbars to 
denote true values, 


X1 + X2X3 = (%, + e1) + (%2 + €2)(%3 + es) 
=X, + X.X, +e, + X2e3 + X3e, + €2e; (1.6-1) 
and the error in the exact computed value is bounded by 
Ey + |X2]€2 + |X3]e3 + €2€3 (1.6-2) 


Using this technique, as in the previous section, we can compute absolute 
and relative error bounds for each arithmetic operation, and using them, we 
can (theoretically at least) compute error bounds for long sequences of 
calculations. There are two main drawbacks to this forward error-analysis 
procedure: 


1. The resulting bounds are often very conservative; i.e., the true error 1s 
unlikely to be near the bound. We saw one example of this in Sec. 1.4-1. 
The result is that we may be forced to use estimates of the probable error 
rather than bounds on the error, thereby incurring an obvious risk. 

2. The analysis itself is often extremely difficult and/or tedious for complex 
calculations. 


Notwithstanding these drawbacks, forward error analysis can lead to useful 
results and has been responsible for some notable results in numerical 
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analysis. For many calculations, however, backward error analysis, which we 
now consider, is to be preferred. 


1.6-1 Backward Error Analysis 


The essence of backward error analysis is to take the result of a calculation 
and determine from it the range of initial data which could have given rise to 
it. Why is this a useful thing to do? One reason is that it is not unusual to 
determine that the range of initial data which could give rise to the result is 
comparable to the inherent errors in these data caused by observational 
errors or by the roundoff committed when exact (decimal) numbers are 
converted to inexact binary numbers on input to the computer. 

The analyses in Sec. 1.5-3 of floating-point operations were backward 
error analyses. For example: 


1. Equation (1.5-9) indicates that the floating-point product of two numbers 
is the true product of the true x and a y whose relative difference from the 
y actually used is ¢,,. Since the relative error committed by rounding an 
exact y to p mantissa bits has a magnitude bounded by 27”, we note that 
the y which would give the calculated floating-point product as a true 
product differs from the y used by no more than this roundoff-error 
bound. 

2. Equation (1.5-19) indicates that the floating-point sum of two numbers is 
the true sum of two summands each of which has a relative difference 
from the summands actually used of €,, where, as above, the relative 
difference is no more than what could be expected from rounding each 
summand to p mantissa bits. 

3. Equation (1.5-25) indicates that the floating-point sum of n numbers is 
the true sum of m numbers each of which differs from the true summand x; 
by a relative difference of less than 7, . In this case, use of (1.5-27) indicates 
that the relative difference between the summands which would give the 
true results and those actually used may be substantially greater than a 
single rounding error. 


The reader might easily conclude that the last example, the only one of 
the three of a nontrivial calculation, is the most typical and that in general 
backward error analysis does not lead to results which show that the result 
obtained could have come from initial data differing from those actually 
used by only the input roundoff error in the data. But a look at (1.5-30) 
indicates that if the double-precision mantissas are retained, the factors 
which could make the calculated sum the true sum differ very little from the 
actual data used. That is, even after s, in (1.5-29) is rounded to single preci- 
sion, the error associated with x, is 


(1 + #,)(1 + €) (1.6-3) 
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with #, given by (1.5-30) and « bounded by 2”. The 2° 7” term in #, means 
that the product in (1.6-3) is very close to 1 + € unless n is very large. And, of 
course, the bound on « is just that of the maximum magnitude of a single 
roundoff error. 

The use of backward error analysis together with such techniques as the 
use of double-precision arithmetic in certain portions of a calculation can 
lead to some very striking results which attest to the value of this type of 
error analysis. One such result will be developed in detail in Sec. 9.4. 


1.7 CONDITION AND STABILITY 


Instability is a phenomenon which results when individual roundoff errors 
propagate through a calculation with increasing effect. In this section we 
shall discuss this phenomenon and the related subject of the condition of a 
problem. 

The system of differential equations 


Vi=y2 


(1.7-1) 
yo2=V1 
has the general solution 
yy =a,e*+a,e * 
mo , (1.7-2) 
yz =a,e"— ane * 
With the initial conditions 
y,(0) = —y2(0) = 1 (1.7-3) 
the constants a, and a, are determined to be 
a,=0 and a,=1 (1.7-4) 


Now suppose that Eqs. (1.7-1) with initial conditions (1.7-3) are solved 
numerically by any method whatsoever whose aim is to compute y, and y, 
at a sequence of points x,, x, .... The effect of roundoff error is to compute a 
numerical solution of (1.7-1) with initial conditions perturbed from those of 
(1.7-3). But even the slightest perturbation of (1.7-3) will result in a nonzero 
a,. Since the a, terms in (1.7-2) are rising exponentials, any nonzero a,, no 
matter how small, will result in an e* term which dominates the e” * term for 
sufficiently large x. Therefore, it is not possible to compute a solution to 
(1.7-1) with initial conditions (1.7-3) which, for sufficiently large x, will not 
result in an arbitrarily large error relative to the true solution. This problem 
is therefore inherently unstable or, to use the more common term in numeri- 
cal analysis, ill-conditioned. 

A second example of ill condition is one discussed at greater length in 
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Sec. 8.13. The polynomial 


f(z) = (z + 1)(z + 2)... (z + 20) (1.7-5) 
has zeros —1, —2,..., —20. However, the polynomial 
f(z) + 27 792"9 (1.7-6) 


has five pairs of complex zeros (one of which is — 19.502 + 1.940i), and 
among its 15 real zeros one is — 20.847. Thus, a change in one coefficient of 
f(z) (which, by the way, is equal to 210) by 2~?° = 107’ results in a large 
change in the zeros. The roundoff in any numerical method for computing 
the zeros of (1.7-5) (given, of course, in the form z*° + a,z'? + ---) has the 
effect that the zeros computed are the true zeros of a polynomial with per- 
turbed coefficients. Thus, computing the zeros of (1.7-5) is an ill-conditioned 
problem (although with multiple-precision arithmetic accurate zeros can 
nevertheless be calculated ). Although, of course, we know the zeros of (1.7-5) 
without computing, it turns out (see Sec. 8.13) that high-degree polynomials 
generally tend to be ill-conditioned with respect to computation of their zeros. 

A well-conditioned problem, therefore, is one in which small changes in 
the data of a problem result, in some sense, in small changes in the solution, 
and an ill-conditioned problem has the opposite property. These two 
examples should illustrate how important it is to know or to be able to 
estimate the condition of a problem before attempting a numerical solution 
of it. 

Condition is related to how sensitive the solution of a problem is to 
changes in its data. Stability, on the other hand, is concerned with the 
numerical method used to solve a problem, in particular with the behavior 
of the roundoff errors introduced in the numerical solution. For example, 
the problem 


y=-y  y(0)=1 (1.7-7) 


has the solution y = e *and is very well-conditioned in that a small change in 
the initial condition [y(0) = 1 + €] results in a small change in the solution 
[y=(1 + e)e”*]. But there are numerical methods for solving this equation 
(one such, Milne’s method, is discussed in Sec. 5.5-2) which give useless 
results for medium to large values of x because the roundoff errors have the 
effect of introducing a spurious rising-exponential solution which soon over- 
whelms the true, falling-exponential solution. Or, in other words, the round- 
off errors introduced at one stage of the computation propagate with 
increasing magnitude in later stages. Such a numerical method is said to be 
unstable although, as is the case with Milne’s method, it may be unstable for 
some problems but stable for others. 

The distinction between the condition of a problem and the stability of a 
method is a most important one to understand. An ill-conditioned problem 
can be solved accurately, if this is possible at all, only by very careful calcula- 
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tion, e.g., multiple-precision arithmetic, quite aside from the method used. A 
well-conditioned problem can be solved accurately by any numerical 
method which is stable for that problem. A method which is unstable for 
a particular (say, well-conditioned) problem may give accurate results in the 
calculation—before the propagation of roundoff error has overwhelmed the 
true solution—but inevitably will give bad results if used long enough. 


BIBLIOGRAPHIC NOTES 


Most texts on numerical analysis contain an introductory chapter (or chapters) with material in 
some degree similar to this chapter. In particular we refer the reader to Dahlquist, Bjorck, and 
Anderson (1974) and Young and Gregory (1972). 

The best book available on floating-point arithmetic, roundoff-error analysis, and error 
analysis more generally is Wilkinson (1964). Knuth (1969) also has an extensive and excellent 
discussion of many of these matters. A recent general textbook with a good chapter on floating- 
point arithmetic is Shampine and Allen (1973). Hamming (1973) devotes a chapter to roundoff 
error. For a deeper understanding of the probabilistic theory underlying roundoff-error 
analysis, see Feller (1950). For more details on the algorithms by which computers perform 
arithmetic see Knuth (1969), Flores (1960, 1963), and Richards (1955). 
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PROBLEMS 
Section 1.3 


1 Let all the numbers in the following calculations be correctly rounded to the number of 
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digits shown: (a) 1.1062 + .947; (b) 23.46 — 12.753; (c) (2.747)(6.83); (d) 8.473/.064. For each 
calculation, determine the smallest interval in which the result, using true instead of rounded 
values of the quantities, must lie. 


2 Let y, = .9863 and y, = .0028 be correctly rounded approximations to x, and x, 
respectively. Find the maximum magnitude of the difference between the calculated values of 
1/y, and 1/y, and the true values. 


3 Use (1.3-5) to approximate the error incurred when correctly rounded numbers are used 
to compute (a) the product of n numbers; (b) the quotient of two numbers; (c) a power of a 
number where the power is known exactly; (d) a power of a number where the power is also in 
error. 

4 (a) Use the results of parts (a) and (b) of Prob. 3 to get estimates of the errors in the 
computations of Prob. 1c and d. Where there are discrepancies between these bounds and those 
calculated in Probs. 1 and 2, what causes them? 

(b) Repeat part (a) on the computations of Prob. 2. 

5 (a) Use the results of Prob. 3c and d to get estimates of the errors in (i) (6.45)'/*? and (ii) 
(8.47)-°4? if 34 is exact and all other numbers are correctly rounded as shown. 

(b) Suppose you wish the values of (i) cos 1.473, (ii) tan ' 2.621, (iii) In 1.471, (iv) e?°*? 
but in each case have a table at an interval of .01 in the argument. Use (1.3-5) to get an estimate 
of the error made by using the nearest values in the tables to the given arguments. 

(c) Suppose each of the values in part (b) is a correctly rounded value. Estimate the 
maximum error incurred using the nearest tabular values. 


6 (a) Derive the Cauchy-Schwarz inequality 


(Zee) <(Ee)(2") 


(b) Use this to show that the Euclidean vector norm (1.3-6) satisfies property 3 for norms. 
(c) Show that the L, and L,, norms satisfy properties 1 to 3. 
(d) Derive the Holder inequality 


ju’v| < lull ivi, 


where p, q are numbers greater than 1 such that 1/p + 1/q = 1. 
(e) Use this to show that the L, norm (1.3-7) satisfies property 3 for norms for 1 < p < ©. 
(f) Show that the weighted L, norm (1.3-9) satisfies properties 1 to 3 for norms. 


n 1/p 
(g) Show that lim ( | v, r} = max |»,|. 
1 lsisn 


poo i= 


7 (a) Show that (1.3-13) defines a matrix norm. 
(b) Given a matrix A and vector x show that 


n 
|| Axi], = max | )) a,,x, 
i {tj=1 


(c) Thus derive that 
WAXl}o SD Lary] > Ly 
j= 


where r is that value of i which maximizes the sum in part (b). 
(d) By choosing appropriate values of x;, show that 


[Aj = max ( ¥ Jay!] 


i j=1 
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(e) Similarly show that 


Jad = max ( ¥ Jay!] 
J i=1 

(f) Show that the matrix norm (1.3-18) is a vector norm in n? space for the vector (a,,, 
1+) Ans @z1, +++) Aq,) SO that properties 1 to 3 for norms hold. 

(g) Show that ||ABI|? < ||A]|2||Bi]2 so that (1.3-18) is indeed a norm [Ref.: Newman 
(1962), p. 228.] 

8 (a) By considering the vector y = x/||x||, show that (1.3-17) holds for any vector x. 

(6) Show that for any given vector norm, the value of the norm of any matrix A given by a 
consistent norm is not less than the value given by the subordinate matrix norm. 

(c) Use the Cauchy-Schwarz inequality to show that || Ax||3 < |/Al|2/|x/}2, so that the 
Frobenius norm is consistent with the L, vector norm. 

(d) The trace of a matrix, tr (A), is defined to be }°"_, a;, and can be shown to be equal to 
y"_, 4;, where the A, are the eigenvalues of A. Show that ||A||/? = tr (A7 A). 

(e) Use this together with (1.3-22) to show that || Al], < n'/?||A]],. 

9 Let A be an eigenvalue of A and x ¥ 0 the corresponding eigenvector, so that Ax = Ax. 
By taking norms of both sides of this equation and using (1.3-17), prove (1.3-21). 


Section 1.4 


*10 (a) Let « have a probability density p(x) on (— 00, 00). Show that the probability that « 
does not exceed x is given by 


x 


P(x)=| p(x) dx 


P(x) is called the probability distribution of «. 
(b) Use this result to show that if ¢, and ¢, are independently distributed variables with 
probability densities p,(x) and p,(x), the probability distribution of €, + €, is given by 


[| pu(s)pa(t) ds at 


sti<x 


(c) Manipulate this integral to show that the probability density of ¢€, + €, is given by 


[pale —e)palt) ae 


—-@ 


*11 (a) Use the result of the previous problem to verify Fig. 1.2. 
(b) Use this result to find the probability density of ¢, + €, + €, when each of these 
variables is independently distributed as shown in Fig. 1.1. 
(c) Compare the results of (a) and (b) with the corresponding normal density functions 
given by (1.4-4) with n = 2 and 3 by plotting the graphs of the functions. 


12 The error function is defined by 


2 x 
erf (x) = = | e’ dt 


Jao 


(a) Show that 


lim erf (x) = 1 


xo 


Hint: Consider erf (x) erf (y) and use polar coordinates. 
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(b) Use this result and (1.4-4) to show that the normal distribution function correspond- 
ing to the normal density function with zero mean and variance n/12 is given by 
444 erf [(6/n)'/2x]. 

(c) Finally deduce that the probability that ¢ in (1.4-4) is less than x in magnitude is given 
by erf [(6/n)'/? |x|]. 

13 (a) Use a table of the error function to verify Eq. (1.4-5). 

(b) How large must n be so that the probable error is less than one-tenth the maximum 
error n/2? For this n, use part (c) of Prob. 12 to estimate the probability that the error is greater 
in magnitude than three-quarters of the maximum error. 


*14 Consider the iteration 
Xie, =Ox, +B VX=a BH#O0Ora(l—a) 


(a) Show that |a| < 1 in order for the iteration to converge, i.e., in order for lim,_, ,, x;to 
exist. 

(b) Let ¢, be the accumulated roundoff error in x;, and let 5; be the roundoff error 
introduced in the calculation of x, from x,_ , (with 6, the roundoff error in x9). Assume « and B 
are known exactly and that all x, are correctly rounded to d decimals. Find a bound 6 on 6, 
assuming the arithmetic is performed as efficiently as possible. 

(c) Use the iteration equation to derive a difference equation relating €;, €;_,, and 6;. 

(d) Use this result to show by induction that 


oF where d;, = 1 


di, = ad;_ 1, ;3 i >j >0 
(e) Thus, deduce that 
di; ~— aid 


1 


i—|al° 


and therefore that le;| < 


(f) Use the results of parts (b), (d), and (e) and Eq. (1.4-7) to show that the variance of the 
probability density of €, is given by 
,_ 10°*1—a%*? < lo* 1 
2 1-d@ 12 1-—¢? 
[Ref.: Henrici (1964), pp. 309-314.] 


15 Use a Lagrangian multiplier to show that o? in (1.4-7) is minimized subject to the 
constraint (1.4-8) when all the a,’s are equal. 


co 


Section 1.5 


16 For each of the four arithmetic operations derive bounds on rounded fixed-point 
calculation of the form 


fi(x op y)=xopyte 


with ¢ suitably bounded. Assume p-digit fixed-point numbers in the range [—1, 1] and no 
overflow in the result. 

17 Consider a floating-point representation (1.5-1) with p = 27, q = 8. Which of the 10 
million numbers 9,000,000.0, 9,000,000.1, ..., 9,999,999.9 have the same representations in this 
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format? Why must it be true in general that in any floating-point system some real numbers 
must have the same representation? 

18 Prove the following results about the floor and ceiling functions: 

(a) |x] = [x] if and only if x is an integer. 

(b) lx] + 1 = [x] if and only if x is not an integer. 

(c) —|x] = [—<x1] for all x. 

*19 (a) Let the floating-point mantissa length p = 4. By considering the addition of 
.1000 x 2? and —.1001 x 2? show that a (2p + 1)-bit accumulator must be available in general 
to get the corrected rounded result when adding two floating-point numbers. 

(b) After m, has been shifted f, —f, places to the right in a (p + 2)-bit accumulator, 
replace m, by 
sgn (m,)2~?~7|2?*?|m,|} if sgn (m,) = sgn (m,) 


sgn (m,)2~?~ 7[2?*?|m,|1 _ if sgn (m,) # sgn (m,) 


Show that this has no effect iff, <f, + 3 (f, =f,). 

(c) If f, =f, + 3, show that the replacement of part (b) results in |m, + m,| > 4and thus 
infer that the bit which governs the rounding is now correctly in the p + 1 or p + 2 position in 
the accumulator. 

(d) Finally deduce that an appropriate application of the transformation in part (5) will 
allow a (p + 2)-bit accumulator to be used to get the correctly rounded result in floating-point 
addition. 

(e) Show that even with a transformation like that in part (b), a (p + 1)-bit accumulator 
cannot give a correctly rounded result in all cases. [Ref.: Knuth (1969).] 

20 Consider floating-point addition with a p-bit accumulator on numbers with p-bit 
mantissas and let « = 2° ?. Assume that when the sum of two mantissas overflows, the overflow 
bit is retained and may be shifted right. 

(a) Show that if no overflow occurs when the two mantissas (one of which may have been 
shifted) are added, then 

f(x + y) = x(1 + €,) + y(1 + €,) 


where |¢,| and |¢,| are both bounded by e«. 

(b) Show that if overfloW occurs, the addition may incur two rounding errors bounded by, 
respectively, 2°-'2-1~? and 2°- 127”, where b is the exponent of the computed sum. 

(c) Thus deduce that in the case described in part (b) 


f(x +y)=(x+yl+e,,) where |e,,| <3 «27? 


[Ref.: Wilkinson (1964).] 

21 (a) Explain why, in practice, adding a sequence of floating-point numbers in increasing 
order of magnitude may not give the smallest error. 

(b) Suppose using an 8-digit accumulator and the algorithm (1.5-24) you wish to add 


1025 x 10 + (—.9123) x 10% + (—.9663) x 10? + (—.9315) x 10! 


Show that these numbers can be added without incurring any rounding errors but that adding 
them in order of increasing magnitude results in some rounding errors. [Ref.: Wilkinson (1964).] 


22 Consider the calculation of the inner product 

S, =SUxy yy + X2V2 + °° + xq) 
(a) Show that 

t, =fl(x,y,) r=l,...,n 


S, =, 
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s, =fl(s,.,+t,) r=2,...,0 


is an algorithm to calculate s,. 

(b) From this algorithm deduce relations analogous to (1.5-24). 

(c) Finally deduce relations analogous to Eqs. (1.5-25) to (1.5-26) and (1.5-29) to (1.5-30). 
[Ref.: Wilkinson (1964).] 

23(a) Using the notation of (1.5-1), what are the largest and smallest unnormalized 
floating-point numbers in magnitude? 

(b) Show that underflow and overflow are possible with unnormalized numbers for all 
arithmetic operations. 


*24 Consider a double-precision arithmetic scheme where a 2d-digit number is held in two 
d-digit words. Suppose that the computer can add two d-digit numbers at a time and can 
recognize when overflow occurs. 

(a) Devise an algorithm for adding two such double-precision numbers assuming all 
numbers are positive. 

(b) If both halves of the double-precision number have a sign associated with them, and if 
the signs are required to be the same, modify this algorithm so that it can handle addition of 
positive and negative numbers so that the sum will also have the same sign in both halves. 

In both parts assume there is no overflow in the overall sum. 


Section 1.6 


25 Find a bound on the relative error in computing the quantity on the left-hand side of 
(1.6-1) if the calculation is performed using floating-point arithmetic. 


26 Perform a backward error analysis on 


Py =SM(x, x2 °°* x,) 


to show that the relative error in the product is of the same order of magnitude as the individual 
roundoff errors made when the true values of x; are rounded to obtain the x,’s inside the 
computer. 


Section 1.7 
27 (a) Consider the linear system 
2x + 6y = 8 
2x + 6.00001 y = 8.00001 


Is this problem well-conditioned or ill-conditioned? Why? 
(b) Consider 


2x + 6y = 8 
2x + 5.99999y = 8.00002 


Solve this system exactly. What does the solution tell you about your answer to part (a)? 
(c) Give a geometric interpretation of your answers to parts (a) and (b). 
28 Consider the differential equation 


y’-y=0 
(a) For what initial conditions 


y0)=a y0)=6 
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is the problem of solving this equation stable? 
(b) What is the relation of part (a) to (1.7-1)? 


29 For the equation f(x) = 0 consider solving this equation by the algorithm 
Guess Xo 
Xi41 = F(x;,) 
where F(x) is somehow derived from f(x). Suppose it can be shown that 


= lim x; 


io 


exists and that f(x) = 0. What can you say about the stability of such a method of solving 


I(x) =0? 


CHAPTER 


TWO 
APPROXIMATION AND ALGORITHMS 


2.1 APPROXIMATION 


We have said that numerical analysis 1s concerned with the solution of 
mathematical problems by arithmetic processes. Clearly, then, the need to 
approximate nonarithmetic quantities by arithmetic quantities—and to 
ascertain the errors associated with such approximations—lies at the heart 
of much of numerical analysis. In a given situation there will usually be 
several possible methods of obtaining the desired approximation. Which of 
these to choose depends upon which of various possible criteria are used to 
judge how efficacious a given approximation is. The following simple 
example will illustrate what these criteria are and how they affect the choice 
of an approximation. 

Suppose that we are given a value of x and we wish to calculate x. The 
following are among the possibilities open to us: 


1. Use the classic method learned by most people in grammar school which 
begins by pairing off digits on either side of the decimal point. 
2. Look up x in a table of square roots and if x lies between two arguments 
in the table, interpolate (see Chap. 3) to find /x. 
3. Use any one of a number of iterative techniques (see Chap. 8) to compute 
x. 


Our object here is not to decide which of these methods should be used 
but to discuss the considerations that must precede any such decision. We 
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naturally assume that all the methods “work,” ie., that they all lead to a 
result which can reasonably be considered an approximation to ./x. The 
basic question we must answer is: What error can we tolerate in the result? 

This question not only recognizes the importance of the errors, both 
truncation and roundoff, incurred in an approximation but, more subtly, 
implies the importance of being able to estimate or bound these errors. The 
latter is a consideration of first importance in choosing a method for the 
solution of a problem. Only when we have methods where it is possible to 
estimate or bound the error can we then try to compare these methods on 
the basis of the magnitudes of the errors to which they lead. Each of the 
above methods for computing the square root has an error which can be 
estimated or bounded and for which the truncation component, at least, can be 
made arbitrarily small by carrying the computation far enough, e.g., by 
computing as many decimal places as desired in the first method. Therefore, 
on the basis of error considerations, each is a reasonable candidate for 
computing the square root. 

The reader may have noted that the question above is ambiguous. What 
do we mean by error in the result? Absolute error or relative error? Do we 
wish to bound the error for all x in some interval, or shall we be satisfied 
with a small average error (where now “average” is ambiguous)? These 
queries need to be answered in practice, but our purpose here has been to 
point out that the primary aim of any approximation is to achieve some desired 
degree of accuracy and implicit in this aim is the assumption that the accur- 
acy can indeed be estimated. 

In one important sense our approach to numerical analysis in this book 
will be basically pragmatic; i.e., except for techniques of special theoretical 
interest, we shall concentrate on methods which are usable in practice. Thus, 
for a given method, we shall usually also wish to answer the question: How 
fast can a solution be computed using a given method? 

In the case of the first method above for the calculation of /x, this 
question Is easily answered since given the number of decimal places desired 
in the square root, this is a finite process with the number and type of 
calculations strictly determined. Using the second method, interpolation, we 
must first choose an interpolation formula which will achieve the desired 
accuracy. Having done this, however, the amount of calculation is again 
strictly determined.f But using iterative processes, the situation is different. 
Our assumption that the method would “work” is equivalent to assuming 
that the iteration converges. But the amount of computation required to 
achieve the desired degree of accuracy depends on the rate of convergence. 
Therefore, determining rates of convergence in iterative processes will 
always be of importance to us. 

Generalizing then from our example of JX, we conclude that the pri- 


t This is really a simplification of the truth; see Sec. 3.5. 
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mary aim of an approximation is to achieve some desired degree of accuracy 
and to be such that the accuracy can be estimated. Second, we are also 
interested in the amount of computation required to achieve the approxima- 
tion. With these heuristic notions behind us, we now proceed to consider 
the general problem of approximation. 


2.1-1 Classes of Approximating Functions 


Much of the approximation done in numerical analysis consists of approxi- 
mating a function f(x) by some combination—most often a linear 
combination—of functions drawn from some particular class of functions. 
The most familiar example of this is the approximation of f(x) by the first N 
terms of its Taylor-series expansion. Another familiar example occurs in the 
trapezoidal rule, in which f(x) is approximated by a sequence of straight 
lines. In the former example and for each straight line in the trapezoidal rule, 
the approximation is a linear combination of functions from the class {x’}, 
n= 0, 1,.... More generally, we may consider the class {p,(x)}, where p,(x) is 
a polynomial of degree n. Another class which is suggested by the impor- 
tance of periodic functions is the class of Fourier functions {sin nx, cos nx}, 
n=0Q, 1, .... There are, of course, a number of other classes of functions 
which would lead to useful approximations in particular cases. Especially 
worth mentioning are: 


Rational functions, which will play an important role in Chap. 7 and which 
can be used, whereas polynomials cannot, to approximate functions 
with poles. 

Piecewise polynomial functions, i.e., functions which are different polyno- 
mials on different subintervals, which will be used in Sec. 3.8. 

Exponential functions. 


But for general application, polynomials and the Fourier functions are by far 
the most important, with the former predominating. Since this assertion 
about polynomial approximations is basic to our study of numerical analysis, 
in the next few pages we shall attempt to justify it. 


2.1-2 Types of Approximations 


Let f(x) be a function which we wish to approximate using the class of 
functions {g,(x)}, n = 0, 1, .... Suppose we approximate f(x) by the linear 
combination 


F(X) & A Go() + a1.g1(x) + °°* + Gn Gm(X) (2.1-1) 


where the a,;, i= 0, 1, ..., mare constants. We shall call (2.1-1) an approxi- 
mation of linear type to f (x). Because the analysis of approximations involv- 
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ing nonlinear combinations of the approximating functions, like most 
nonlinear analysis, is very difficult, we shall be concerned almost entirely 
with approximations of linear type. In Chap. 7, however, we shall make 
extensive use of approximations of rational type which have the form 


~~ 409o(X) + 41.91 (X) + °° + On Gm(X) _ 
I(x) % bo go(x) + by gi(x) + °° + by gi (x) (2.12) 


The crux of the approximation problem is the criterion to use in choos- 
ing the constants in (2.1-1) and (2.1-2). Three methods of doing this lead to 
three types of approximations of major importance: 


1. Exact or interpolatory approximations, in which the constants are chosen 
so that on some fixed set of points x;,i = 1,..., p, the approximation and 
its first r; derivatives (where r; is a nonnegative integer) agree with f(x) 
(except for roundoff). 

2. Least-squares approximations, in which the object is to minimize the 
integral of the square of the difference between f(x) and its approxima- 
tion (perhaps multiplied by a suitable weighting function) over an inter- 
val [a, b] or, more commonly, to minimize the weighted sum of the 
squares of the error over a discrete set of points of [a, b]. In the language 
of Sec. 1.3-3 we wish to minimize the L, norm. 

3. Minimum maximum error approximations, where the aim is to minimize 
the maximum magnitude of the difference between f(x) and its approxi- 
mation (again perhaps suitably weighted) on an interval [a, b]. In the 
terminology of Sec. 1.3-3, we wish to minimize the L,,, or uniform 
norm. 


These heuristic definitions have been given here in order to orient the 
reader to what follows. They will be made properly precise in later chapters. 
Of the three types, exact approximations are generally easier to derive and 
analyze (i.¢., errors can be more easily estimated) than the other two. Chap- 
ters 3 to 5 will be exclusively concerned with exact approximations. Because 
of their particular advantages in certain applications, we shall discuss least- 
Squares and minimum maximum error approximations in Chaps. 6 and 7, 
respectively. 


2.1-3 The Case for Polynomial Approximation 


By a polynomial approximation we mean one of the form (2.1-1), where each 
gi(x) is a polynomial. The computational case for the use of polynomials as 
the approximating functions follows directly from the fact that a digital 
computer can perform only the computational operations of arithmetic. A 
piecewise rational function is then the most general kind of function that can 
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be evaluated directly on a digital computer.t Thus, in approximating a function 
f(x) using any other class of functions {g,(x)}, we must first evaluate the 
functions g,(x) using some approximation of each g,(x) as a polynomial in x 
or rational function of x. As a word of caution, though, it is easy to overem- 
phasize this advantage of the class {x”}. Thus, for example, in Chap. 6 we 
shall indicate how Fourier approximations can be calculated with only a 
singie evaluation of a sine and a cosine. 

One property that the powers of x and the trigonometric functions (as 
well as exponential functions) have in common is that an approximation 
using either of these classes changes its coefficients but not its form if the 
origin of the coordinate system is changed. Thus, if P(x) is a polynomial or 
rational function, so is P(x + «), and if T(x) is a linear or rational approxi- 
mation using sines and cosines, so is T(x + a). Approximations using 
powers of x have the further advantage that if the scale of the variable 1s 
changed, again the coefficients but not the form of the approximation are 
changed. Thus P(kx) is still a polynomial in x. But this property does not 
hold for approximations using sines and cosines. That is, in general, for 
noninteger k, sin nkx is not a member of the class {sin nx}. Finally we note 
that if P(x) and Q(x) are polynomials, so is P(Q(x)) and if R(x) and S(x) are 
rational functions, so is R(S(x)). 

One further obvious analytic advantage of polynomials is the ease with 
which they can be manipulated in general and differentiated and integrated 
in particular. In classical analysis, this enables us, for example, to express the 
remainder term in a Taylor series in closed form, and, as we shall see, there 
are analogous advantages in numerical analysis. A similar analytic advan- 
tage is also possessed by the Fourier functions. 

All the advantages of the class {x"} that we have mentioned would be for 
naught if there were no analytic basis for our hope that we can achieve 
arbitrarily high accuracy with this class. We assume the reader is familiar 
with the result [Courant and Hilbert (1953), p. 65] that the set of functions 
{x" is complete over any interval [a, b]; that is, for any piecewise continuous 
function f(x), given any ¢ > 0, there exists an n and coefficients ay, ..., a, 
such that 


b n 2 

| Lyte) 5 ax dx <e€ (2.1-3) 
a i=0 

Since sines and cosines also form a complete set, there is a result analogous 
to (2.1-3) for them. This result assures us that we can achieve arbitrarily 
good least-squares approximations using linear combinations of polyno- 
mials. To show that we can achieve arbitrarily small minimum maximum 


+ This is a slight exaggeration. More general functions can be evaluated using the logical 
operations or operations on the magnitude of a number which are available on all computers. 
But for all practical purposes, the statement above is true. 
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error with linear combinations of polynomials, we shall now prove the 
classical theorem of Weierstrass on polynomial approximation. As a corol- 
lary of this theorem, we shall then obtain a similar result for Fourier 
approximations. 


The Weierstrass Approximation Theorem 


Theorem 2.1 (Weierstrass) If f(x) is continuous on a finite interval 
[a, b], then, given any € > 0, there exists an n [= n(e)] and a polynomial 
P,(x) of degree n such that 


| f(x) — P,(x)| < (2.1-4) 
for all x in [a, Db]. 


Thus, by requiring continuity instead of piecewise continuity, we achieve 
uniform approximation instead of approximation in the mean as in (2.1-3). 


ProoF Our proof of this theorem is due to Bernstein (1912). It is 
achieved by constructing a sequence of polynomials which converges 
uniformly to f(x). Without loss of generality, we let a = 0, b = 1 since 
any other interval can be reduced to [0, 1] by a simple change of variable 
{11}. On this interval, the Bernstein polynomial of degree n is defined by 


B,(x) = x (;) #0 - or ¥(‘) (2.1-5) 


We shall show that 
lim B,(x) =f (x) (2.1-6) 


uniformly in [0, 1]. To prove this theorem, we require the following 
lemma. 


Lemma 2.1 The following identities are true: 


y (;) ~xy*=1 


k=0 
> : (;) —x)"*=x (2.1-7) 
k= 

" k?(n\ , nok 1\ , 1 

& oa (;)st-» = (1- + x 


PRooF To derive all three identities, we use the binomial expansion 


(p + q)"= y (j)onar (2.1-8) 


k=0 
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The first identity follows immediately when p + q = 1. The other two 
follow by differentiating (2.1-8), respectively, once and twice with respect 
to p and then setting p + q = 1 {12}. 

Combining the identities in (2.1-7), we have 


Elin) (eee 


n 


Multiplying the first identity in (2.1-7) by f(x) and subtracting B,(x), we 
have 


fo) — Bix) = ¥ |ro)—s (5) |(f]sa— art G10) 


Since f(x) is continuous on [0, 1], it is both uniformly continuous and 
bounded. Therefore, there exist 6 and M such that 


€ ; 
| f (x1) - f (x2)| =) if [x4 — X3| <6; x1, x2 € (0, 1] 


[f(x)| <M xe[0, 1] 
where € is as given in the statement of the theorem. 


For any x, we divide the points k/n, k = 0,..., n, into two sets A and 
B such that 


(2.1-11) 


, \in A if <6 


k 
—-—x 
n 


in B otherwise 


Then, using the first of the relations (2.1-11) and the first of the identities 


(2.1-7), we have 
<5 (rs; 


» yer—( lt “una (2.1-12) 


Using the second of the relations (2.1-11) and (2.1-9), we have 
» Lyte -s(‘)|(?)#0 —x)'"*|<2M (j} _ xy 


-ae etal or 


— 2M x(1 — x) e M 
at nr ia re 


(2.1-13) 


since 0 < x(1 — x) <4 on [0, 1]. Now if we choose 7 so that 
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M _é 

2nd? 2 
and combine (2.1-12) and (2.1-13) in (2.1-10), the theorem is proved by 
identifying P,(x) with B,(x). 


(2.1-14) 


For periodic functions, we have an analogous theorem. 


Theorem 2.2 (Weierstrass) If F(t) is a periodic continuous function of 
period 27, then, given any ¢ > 0, there exists an n [= n(e)] and a trigon- 
ometric sum 


S,(t) = ao + > (a, cos kt + b, sin kt) (2.1-15) 
k=1 


such that 
| F(t) — S,(t) | <€ (2.1-16) 


for all t. 


Using Theorem 2.1, this theorem is not hard to prove; we leave the 


proof to a problem {13}. 


In a sense, then, Theorem 2.1 justifies the use of polynomial approxima- 


tions since it guarantees that a polynomial can be found with an arbitrarily 
small maximum deviation from f(x) on [a, b]. In fact, since the proof is 
constructive, the reader may well think that we have solved the problem of 
minimum maximum error approximations. Furthermore, our reasons for 
preferring exact approximations—ease of derivation and error analysis— 
may seem to be weak in the light of Theorem 2.1. But unfortunately we have 
neither solved the minimum maximum error problem nor really weakened 
the case for exact approximation. The reasons for this are as follows: 


1. 


In deriving minimum maximum error approximations, we shall be con- 
cerned with finding, for example, that polynomial of degree n which has 
the minimum maximum error as an approximation to f(x). The Bernstein 
polynomial of degree n is by no means this polynomial. 


. The usual situation with exact approximations is that we are given only a 


sequence of points x; and the corresponding f(x;), that is, a table. But, 
using Bernstein polynomials, we are forced to use particular values of f (x) 
which may not be available. Moreover, the Bernstein polynomial of 
degree n is no help in deriving the polynomial of degree n + 1. This 1s a 
serious drawback, as will be made clearer in Chap. 3. 


. The ease with which it appears possible to bound the error in a Bernstein 


polynomial is a mirage, because the error bound tends to be extremely 
conservative. The following example will illustrate this. 
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Example 2.1 What degree Bernstein polynomial will guarantee an error of less than .2 in 
approximating e* on [0, 1]? 
We have the inequality on [0, 1] 
je“! —e*| <e|x, —x,| 

which follows from the mean-value theorem. Therefore, with ¢« = .2, taking 6 < .2/2e 
satisfies the first relation of (2.1-11). Since M = e, in order to satisfy (2.1-14) we need 
M 5 e? 
66? 2x 1073 


n> > 10,000 


However, if we compute the Bernstein polynomial of degree 2, we find 
B,(x) = (1 — x)? + 2x(1 — x)e!/? + x’e 


and by direct calculation 
| e* —_ B,(x){ < ll 
on [0, 1]. Therefore; the estimate resulting from the proof of Theorem 2.1 is extremely 


conservative. Moreover, using the techniques of Chap. 7, we can achieve a substantially 
smaller maximum error than .11 with a quadratic approximation to e* on [0, 1]. 


Our conclusion, then, is that while the Bernstein polynomials lead to an 
elegant constructive proof of the Weierstrass theorem, they do not in them- 
selves generally give useful polynomial approximations. 

The assurance that Theorem 2.1 gives us that we can find polynomial 
approximations which have arbitrarily small error, combined with the com- 
putational and other analytic advantages mentioned previously in this 
section, make a strong case for polynomial approximations. Their domin- 
ance in the next four chapters will not surprise the reader. 

The case for polynomial approximation, however, 1s not so good as to 
rule out all other types. In Chap. 7 our approximating functions will always 
be polynomials, but we shall make extensive use of rational approximations 
of the form (2.1-2). Approximations using functions other than polynomials 
are of considerable importance also. We have already noted the importance 
of periodic functions and the fact that the Fourier functions satisfy results 
analogous to those discussed in this section for polynomials. 

The importance of band-limited functions, functions whose constituent 
periodicities are known to be bounded, in many servomechanism and 
related problems is another indication of the importance of approximations 
using the Fourier functions. We shall touch on such approximations in 
Sec. 6.6, but for a more extensive treatment we refer the reader to Hamming 
(1973). 


2.2 NUMERICAL ALGORITHMS 


If numerical analysis is concerned with the processes by which mathematical 
problems can be solved by the operations of arithmetic, then the practice 
of numerical analysis requires that a problem statement be turned into a 
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sequence of arithmetic operations which convert the data of the problem into 
the results. The sequence of arithmetic operations and the set of decisions 
which indicate which operation in the sequence to perform next constitute 
a rule, a recipe for solving the given problem. Such a rule in mathematics 
and computer science is called an algorithm. 

One of the most active areas of research in computer-related mathema- 
tics is the analysis of algorithms. This subject treats such matters as the 
speed of convergence of algorithms, the accuracy of algorithms, and the 
relationship of one algorithm for solving a particular problem to another for 
the same problem. Strictly speaking, as a mathematical science, numerica! 
analysis is concerned with 


The development of algorithms 
The analysis of algorithms 
The computer implementation of algorithms 


In this section we shall illustrate the basic characteristics of algorithms 
by an example of a simple but effective algorithm. 


Example 2.2 
Bisection 
Given: f(x) continuous on {a, b] and such that 
f(a) f(b) < 0 
Problem: Find a point a@ such that 
f(a) =0 


We define an iterative algorithm as follows. Let I; = (a, b;), i = 0, 1,..., 19 = (a, 5) define 
a sequence of intervals and let m, be the midpoint of J,. Let 


L..= (a;, m,) if f(a,) f (m,) < 0 
TT \(m,, 6;) if f(m,)f (b,) < 0 
Continue until {(m,) = 0 or the length of I, divided by the maximum of 1 and |m,| (see 


below) is less than some given € > 0, in which case approximate « by m,. 
Using a rather formal notation we could describe this algorithm as follows: 


(2.2-1) 


Input 
a, b, €, f (x) 

Algorithm 
I,<-(a, b); my —(a + b)/2 
for i = 0, 1, 2,... do 


if f(m,) = 0 then «<m,; stop 
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if f(a;)f(m,)<O — then I;—(a,, m) 


else [;<—(m;, 5;) 
endif 
m, (a; + 5,)/2 
if length (1;)/max(|m;,|, 1) < € then «+—m,; stop 
endfor 
Output 


4 


The strictly mathematical analysis of this algorithm, like that of any algorithm, con- 
sists of showing that 


1. If it terminates, the result is the desired one; ie., the algorithm is effective. 
2. The algorithm does terminate. 


First we show that 
lim m; = a% (2.2-2) 


where a Is some point at which f(x) = 0. 
The proof is quite straightforward: 


1. By our hypotheses, at least one a such that f(«) = 0 exists. 
2. It also follows from (2.2-1) that each interval J; contains a point « such that f(a) = 0. 
3. Since m, is the midpoint of I,, it follows from (2.2-1) that the length of J; , , is one-half 
the length of I,;. Therefore 
lim length (/;) = 0 (2.2-3) 
Thus, eventually each J; contains just one point o% such that f(a) =0 and this a 
must be the point to which the m; converge. 


The proof that the algorithm must terminate is even simpler. If f(m,) = 0 at any stage, it 
terminates. If not, then (2.2-3) indicates that the condition length (/;) < € must be satisfied 
at some point. 

The computational analysis, however, is not quite so trivial. 


1. Roundoff may cause a result with a value of m, not within ¢ of the true value. For suppose 
at some stage of the iteration roundoff causes the evaluation of f(m,) to have a sign 
different from its true sign. The effect of this will be that the final value of the root 
cannot be closer to the true root a than |m;—a| (why?). While it is true that this 
problem is very unlikely to arise except when m, is already quite close to a, nevertheless 
it is quite possible for it to occur well before |m; — «| < ¢. It is worth noting that this 
phenomenon is likely to occur only when f(x) is quite “ flat” in a neighborhood of a, in 
which case an error in m, larger than « may be tolerable. 

2. The algorithm may not terminate. For if is too small, roundoff in the calculation of the 
length of I, divided by m,, even though it involves only a single subtraction and 
division, could prevent this length from ever being less than ¢«. Note here the importance 
of using length (J ,)/max(|m,|, 1) rather than just length (/,). If m, is large, two successive 
floating-point numbers in the neighborhood of m, may not be very close (cf. Sec. 1.5-2). 
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Therefore, length (/;) = 6, — a, may never get very small, and if € is small, a test of 
length (1,) < € might never be satisfied. When |m,| > 1, the effect of dividing by m, is to 
normalize the length of the interval, thus making it much less likely (but not impos- 
sible) that the test will never be satisfied. The reason for using max (|m,|, 1) as a divisor 
is to protect against a root very near to 0. 


Both these effects are remote but not impossible. A good computational algorithm, i.e., the 
implementation of the algorithm on a computer, should be prepared for either eventuality. 
Our purpose here is not to consider how this might be done but merely to introduce the 
potential difficulties in order to alert the reader to computational problems not apparent 
in the mathematical statement of an algorithm. 


2.3 FUNCTIONALS AND ERROR ANALYSIS 


The analysis of the truncation error in an approximation, particularly in the 
case of polynomial approximation, is significantly facilitated using the con- 
cept of a functional. Given a family F of functions defined on an interval [a, b], 
we define a functional F(f) to be a mapping from F into the set of real (or 
complex) numbers which assigns a unique number to each function fe F. 
For example, I(f) = f° f(x) dx is a functional, as is the point functional 
f(x), where x is a fixed point in [a, b]. So also are the point derivatives f(x) 
when they exist. A functional F is linear if 


F(af + Bg) = aF(f) + BF(g) (2.3-1) 
All the functionals mentioned above are linear. However, the functional 
If lo =O £7(x) dx]'/? is not. 

In many branches of numerical analysis, e.g., in evaluating I({) defined 
above, we are concerned with evaluating a particular (usually linear) func- 
tional. Since this may not always be possible, we must, as noted in Sec. 2.1, 
find an approximation to this functional. One standard method of approxi- 
mating F(f ) is to approximate f by some other function g, usually a polyno- 
mial p,,, and then use F(g) or F(p,,) as an approximation to F(f). We do this 
by choosing a g for which we can evaluate F(g) exactly. For example, to 
approximate I(f), we evaluate I(p,), where p,(x) is a good approximation 
to f(x) in [a, b]. In these cases, it is important to determine an expression 
for the error in the approximation. Now since g, the approximation of f, 
depends upon f, F(g) is itself a functional of f, say G(f). We define the 
error functional as 


E(f) = F(f) — G(f) = F(f) — F(g) (2.3-2) 
If F is linear, we have E(f) = F(f — g). 


Example 2.3 Let F(f) = I(f) = f° f(x) dx, and let us approximate f(x) by the linear 
polynomial 


pilx) = f(a) +5— [7 (6) - S(a)] 
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) + f(b)] = GS) 


b b—a 
Then F(p,)=| pix) dx =— 


and [see Eq. (4.10-8)] E(f) = F(f) — G(f) = —[(b — a)*] f'"(¢)/12 for some ¢ such that 
a<é&<b. 


In general we shall be considering approximations to integrals and 
derivatives and functional values at particular points of a function f(x) by 
linear combinations of integrals and derivatives and functional values at other 
points. Thus we can write 


E(S)= { Le) Fx) + ay(x) f(x) + + oye) f°) 


- Y Br I (Xt0) — Spal (x31) — 0° — y Bin f (Xin) (2.3-3) 


where the a,(x) are piecewise continuous over [a, b]. Usually this error func- 
tional will vanish when f is a polynomial of a specific degree n or less. 


Theorem 2.3 (Peano) Let f(x) have a continuous (n + 1)st derivative in 
[a, b], and let a linear functional F(f) of f be approximated by a linear 
functional G(f)such that E(/) as given by (2.3-3) vanishes when f is any 
polynomial of degree n or less. Then 


E(f) = f f* D(t)K(t) dt (2.3-4) 
where K(t) = 5 E,{(x — ty] (2.3-5) 
and (x — ty, = ts —# * = (2.3-6) 


The notation E, means the linear functional E is applied to the x vari- 
able in its argument [(x — t)', in (2.3-5)]. The function K(t) is called the 
Peano kernel for the linear functional E. It is also called an influence 
function for E. 


ProoF By Taylor’s theorem with exact remainder, 


4 Lay = ay 


n! 


+ - i f'"* D(t)(x — t)" at (2.3-7) 


The integral remainder can be written as 


b 
= [ FOP) ce — Oy at 


f(x) =f (a) + f'(a)(x — a) +: 
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We apply E to both sides of Eq. (2.3-7) and use the fact that E 
vanishes for polynomials of degree n or less: 


EFC) = FE. | FI de (23.8) 


Now since E has the form (2.3-3), we can interchange E, and the inte- 
gral; hence, 


l b 
E(f) =— | f° DDE (x — 1)", dt (2.3-9) 
which completes the proof. 


Corollary 2.1 


JE(S)| < max (| f*(x)]) [ |K(t)| dt (2.3-10) 

Corollary 2.2 If, in addition, K(t) does not change its sign on [a, b], then 
fr 1)(€) 

E(f) = iat 1)! E(x"*")) a<&<b (2.3-11) 


Proor Under the additional hypothesis we can use the mean-value 
theorem for integrals and obtain 


BUS) =s*™@) [ K(e) a (23-12) 
Now we insert f(x) = x"*! in (2.3-12) to get 
E(xt*1) = (n+ 1)! | "K(t) dt (2.3-13) 


This yields (2.3-11). 


2.4 THE METHOD OF UNDETERMINED COEFFICIENTS 


A common situation in numerical analysis is to be given the form of the 
linear functional by which f is to be approximated and the points at which 
f (x) and/or its derivatives can be evaluated but not the weighting coefficients 
in the linear functional. Thus we are given 


F(f) = Dy y wy SO(x:)) + E(f) (2.4-1) 


where f(x) is given at the points Xij,i=1,...,mj,j =0,...,n, and we wish 
to determine the w,, so that E(f) has certain properties. The method of 
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undetermined coefficients, when it can be applied, is a simple and straight- 
forward way of doing this. 

Suppose our aim is that E(f) = 0 when f(x) is a polynomial of degree N 
or less, where N = ()""_9 m,) — 1. To this end, we replace fin (2.4-1) succes- 
sively by 1, x,..., x”. Then the requirement that E(f) = 0 for these functions 
yields a system of N + 1 linear equations in the N + 1 unknowns w,, 


2 Lyx ay = Fe) k= 0,...,N (2.4-2) 


1 


-. 
It 


We assume here that F(x‘) can be evaluated exactly. The linear system 
(2.4-2) will have a unique solution yielding the coefficients w,, provided the 
matrix of coefficients is nonsingular. In many situations we are assured that 
this is so; e.g., when only functional values are given (n = 0), or when a 
particular derivative f(x) is prescribed at x;, only iff4~ ‘»(x) is also given, it 
can be shown that the matrix is nonsingular. 

Sometimes we may want to keep a few parameters free to achieve some 
other end than accuracy as measured by the highest-degree polynomial for 
which E(f) vanishes. This can be done by reducing the number of equations 
in (2.4-2). 


Example 2.4 Given f(a), f'(a), and f(b), compute approximations to 


f ) and [700 dx 


2 


Here N = 2, x1. = a, X1, = 4, and x,,. = b. In the first case, we get the following system of 
linear equations 


IWio + Ow, + lwo = 1 


a+b 
aWi9 + lw, + OW 9 = 5 (2.4-3) 
b\ 2 
a?Wio + 2aw,, + b? Wa = (* 


and the solution is w,) = 3, w,, = (b — a)/4, Wao = 4, so that we have the approximation 


(25) ware) + Fra) + 470) (24-4) 


In the second case, we need only replace the right-hand side of (2.4-3) by §§ x* dx, 
k =0, 1, 2. The solution of the resulting system is wi9 = $(b — a), wy; = 3(b — a)’, 
W29 = 4(b — a), so that we have the approximation 


J Fle) dx "=" [4p(a) + 270) + 6 - a)SCal 24-5) 


To get the error in these approximations, we can apply Peano’s theorem. In the first 
case, the Peano kernel does not change sign, and so we can use (2.3-13) to derive {24} 


(a — 6)? 


a3 (6) a<&<b 


E(f)= 
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Therefore, we can rewrite (2.4-4) as 


6(F*) = 271 +°F*r a) sare) + SFP pre) aces 


2 
Similarly, in the second case, we can apply Peano’s theorem and rewrite (2.4-5) as {24} 


78 ay(a) + 2F(6) + (6 - a) f(a] +” 


[ S(%) dx = = [la — 6)? + 6ab71/ (2) 


BIBLIOGRAPHIC NOTES 


Section 2.1 A number of works treat the general problem of approximation in much 
greater depth than we have done here. In particular, we recommend Achieser (1956), Natanson 
(1964), Cheney (1966), Davis (1963), Rice (1964, 1965), and Rivlin (1969). Some of the more 
specific works on approximation will be referred to in later chapters. 

The cases for and against polynomial approximation and the case for trigonometric 
approximation are well presented in Hamming (1973). Our proof of the Weierstrass theorem is 
due to Bernstein (1912) and can also be found in Achieser (1956) and Rivlin (1969). For a quite 
different proof, see Courant and Hilbert (1953). 


Section 2.2 The best and most comprehensive references on the development and analysis 
of computer algorithms is the series of books by Knuth (1968, 1969, 1973). 


Section 2.3 For more on Peano’s theorem see Sard (1963) and Davis (1963). 
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PROBLEMS 


Section 2.1 


1 Find a first-degree polynomial approximation P(x) = ax + 6 to sin x on [0, 2/2): 
(a) Using the first nonzero term of the Maclaurin expansion of sin x. 
(b) Which minimizes {%? [P(x) — sin x]? dx. 
(c) Which minimizes {§/? x[P(x) — sin x]? dx. 
For all three approximations, draw a graph of the error E(x) = sin x — P(x). In what sense 1s 
the approximation (a) an exact approximation in the terminology of Sec. 2.1? What are the 
significant characteristics of the error in (a) compared with those in the least-squares approxi- 
mations (b) and (c)? 
*2 Now consider the problem of finding an approximation to sin x of the form P(x) = 
ax + b which minimizes 
max | P(x) — sin x| 
xe[0, 2/2) 
(a) Prove that any such P(x) must be such that E(x) = sin x — P(x) has at least two zeros 
in [0, 2/2]. 
(b) Starting from the result of Prob. Ic, derive an approximation to sin x of the form 
ax + b with a smaller maximum error. 
3 For parts (b) and (c) of Prob. 1, derive the equations for a and b when sin x is replaced 
by f(x). 
4 Repeat Probs. 1 and 3 when P(x) is replaced by the quadratic approximation 
Q(x) = ax? + bx +. 
5 Consider the continued-fraction expansion for the inverse tangent 


t2 


2n + 1 


(a) Show that for any n this is equivalent to approximating tan” ' t by a rational function. 

(b) For n = | and 2 compute the rational approximations and draw the graph of the error 
in the approximations on the interval [0, 1]. 

(c) For each of the approximations of part (b), consider the analogous approximation 
derived by truncating the Maclaurin-series expansion for tan” ' t after the term of degree equal 
to the sum of the degrees of numerator and denominator in part (b). Draw the graphs of the 
errors and compare with those in part (5). 


6 Derive an approximation to e* of the form 


which at x = O has the same value and the same first two derivatives as e*. Draw the graph of 
the error for —1 <x < 1 (cf. Sec. 7.4). 


*7 Let sin x be approximated by the first three nonzero terms of its Maclaurin expansion. 
(a) For any x find a bound on the magnitude of the truncation error. 
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(b) Convert the coefficients to correctly rounded 10-decimal-digit numbers. 

(c) If |x| <1 is correctly rounded to 10 decimal digits, find a bound on the error in the 
correctly rounded value of x?. 

(d) Use this bound to derive an approximate bound on the roundoff error incurred in 
evaluating the approximation in part (b) if all intermediate quantities are rounded to 10 deci- 
mals. Again assume |x| < 1 and neglect all quantities with order of magnitude less than 107¢. 

(e) For x = n/4 perform the calculation carrying 10 decimals at every stage. Compare the 
result with the correct value. Is the error caused by truncation and roundoff within expected 
limits ? 

8 Neglecting roundoff error, how many terms of the Maclaurin expansion for sin x are 
required to obtain a maximum error of less than 107’ in the range (0, n/2]? In the range [0, 2]? 


9 Let f(x) be approximated by 


F(x) © Py(x)f (a1) + P2(x)f (a2) 


where P,(x) and P(x) are linear polynomials. 

(a) Find P,(x) and P(x) if the approximation is to have no error at x =a, and a, 
independent of the function f(x). What is the advantage of having P ,(x) and P,(x) independent 
of f(x)? 

(b) Find an expression for the error in the approximation when f(x) = x? + ax + b. How 
do you explain the dependence of the result on a and b? 

*10 A common iterative method for calculating the square root of a number a uses the 
formula 


1 a 
mari = 5 (% + =| n=1,2,... 


n 
where x, iS an arbitrary positive number. 
(a) Prove that if lim x, exists, this limit is /a. 


(b) Prove that ifa <x, <1, thena<x,,, <1. 
(c) Prove that if 0 < a < 1, then the iteration does converge, that is lim x, does exist. 


11 (a) Derive the change of variable which transforms any finite interval [a, b] to [0, 1). 

(b) By making the appropriate change of variable in (2.1-5), calculate the Bernstein poly- 
nomials of degree 1, 2, and 3 for f(x) = sin x on the interval [0, 2/2]. Draw the graph of 
sin x — B,(x) in each case and for n = 1 and 2 compare these errors with the corresponding 
errors in Probs. 1! and 4. 


(c) Using the inequality (2.1-14), what value of n guarantees a smaller maximum error 
than that found in part (b) for n = 1? 


12 (a) Derive the identities (2.1-7). (b) Using these identities, derive (2.1-9). 
*13 Let F(t) be a periodic continuous function of period 2” and define 
_ F(t)+ F(-t) 
= 


ao _F)- Fl-0 


sin t 


w(t) 
(a) Use Theorem 2.1 to show that, given any ¢ > 0, there exist polynomials P(x) and Q(x) 
such that 


| @(t) — P(cos t)| < | y(t) — Q(cos t)| <5 


€ 
4 
for all t. 

(b) Thus deduce that 


|F(t) sin t — U(t)| << 


<5 where U(t) = Q(cos t) + P(cos t) sin t 


APPROXIMATION AND ALGORITHMS 49 


is a trigonometric sum. Similarly, show that there is another trigonometric sum V(t) such that 


F(F - ; sin t — V(t) 


€ 
<- 
2 
(c) Use the two inequalities of part (b) to deduce that 


SE 


F(t) — U(t) sin t — “(5 _ } cos t 


and from this complete the proof of Theorem 2.2. [Ref.: Achieser (1956), pp. 32-33.] 


*14 (a) If F(t) in Prob. 13 is also even, show how the proof can be considerably 
simplified. 
(b) For the function 


t—2mxn = 2mn<t <(2m+ 1)n 


F(t) = 
() 2mn—t (2m—1)n<t<2mn 


where m takes on all positive and negative integral values, use this simplification to derive an 
approximation of the form (2.1-15) to F(t) using first- and second-degree Bernstein polyno- 
mials. How do you explain the relation between the two results? Draw the graph of the error. 


15 For f(x) = 1/(1 + x?) calculate the Bernstein polynomials for n = 2, 3, 4 on the inter- 
val [— 1, 1] and draw the graph of the error in each case. (This function is an example of a case 
where exact polynomial approximation leads to certain difficulties; see p. 65 n.) 


Section 2.2 
16 Let X,, X,,..., X,, be distinct real numbers. We wish to find m and j such that 


m= max X,= X, 
lsksn 


for which j is as large as possible. 
(a) Prove that the following is an algorithm for this problem: 


jon; ke—n—-1; mex, 
L: if k = 0 stop 
if X, > m then jk; meX, 
k«k—-—1 
go to L 


(b) Let A be the number of times the “then” part of the “if... then ” statement is executed. 
If p,, tepresents the probability that A = k, show that 


n—1 


Pak = = Pa-1,k-1 + Pa-1,k 
n 


(c) Use this to deduce that the generating function 


G,(z) = E Pa 


1 >") 
z+n n 


is given by 
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(d) Show that mean(A) = G;(1) and thus deduce that 
mean (A) = H, — 1 


"1 
where H, is the nth harmonic number Y i} 
i=1! 
[Ref.: Knuth (1968), pp. 95-99.] 


*17 Suppose instead of the algorithm of the previous problem we wish to find the two 
largest among distinct X,,..., X,, n > 2. 
(a) Prove that the following is an algorithm for this problem. 


ke—n— 2; m'<—X,_,;m—X, 
L: if m' > m then t —m’; m’<—m; mt 
if k = 0 stop 
if X,>m' then m'<X, 
kek-1 
go to L 


(b) As a function of n, find the average number of times the exchange in the line labeled L 
is performed. 
(c) Find the average number of times the step m' —X, is performed. 
18 The Euclidean algorithm to find the greatest common divisor d of two integers m and n 
iS 
cm; de—n 
L: q«-c/d; r+-c— qd _ _ (integer division) 
if r = 0 then stop 
c+«-d; d+-r 
go to L 


(a) Under what conditions on m and n does this algorithm terminate? 

(b) When it terminates, prove that it is correct. 

(c) Design a modification of this algorithm which will also produce integers a and b such 
that am + bn = d. 

19 (a) Consider the bisection algorithm for a function f(x) which is only piecewise con- 
tinuous in (a, b] but for which f(a) f(b) < 0. What can be said about its convergence? 

(b) Apply bisection to f(x) = x* — 2x — 5 with a = 0, b = 3. 

20 Let f(x) be continuous and differentiable in a neighborhood of « such that f(«) = 0. 
Consider the iteration 


Xin. =X, -S (x) i=1,2,... 


where xX, is an initial approximation to a. 

(a) Derive a necessary condition on f(x) in a neighborhood of « for the convergence of the 
sequence {x,} to a. 

(b) Apply this method to the function of Prob. 19b with xo = 3. 

(c) Apply this method to f(x)/x*, where f(x) is the function of Prob. 19b, again with 
Xo = 3. Explain the difference in the results. 
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Section 2.3 


21 If F is the family of all functions defined and continuous on [a, b], which of the 
following are linear functionals where f is in F ? 


Df eve de  GF@+/O-F(5S) Git) | LHP ax 


22 (a) What is the Peano kernel for the functional 


E(f) = —f (xo) + 3f (x1) — 3f (x2) +F (x3) 


where x;4, — xX, =h, i= 0, 1, 2, and if f is continuous and infinitely differentiable? 
(b) Show that this kernel does not change sign on [x 9, x3] and thus apply (2.3-13) to 
compute E(f). 


23 Use Peano’s theorem to find the error in Simpson’s rule: 


J fedax = (0) + 700) +00) 


24 Derive the two error terms in Example 2.4. 


Section 2.4 


25 Given f(a), f(b), f'(a), and f'(b), compute approximations to f((a + b)/2) and 
{2 f(x) dx and compare the results with those of Example 2.4. 

26 (a) Given f(a), f(b), f'(a), and f’(b), compute an approximation to f° f(x) dx using 
N =2. 

(b) Determine the free parameter in the solution so that the sum of the squares of the 
coefficients w,, is minimized. Why might you wish to do this? 


CHAPTER 


THREE 
INTERPOLATION 


3.1 INTRODUCTION 


Interpolation lies at the heart of classical numerical analysis. There are two 
main reasons for this. The first is that in hand computation there is continual 
need to look up the value of a function in a table. In order to find the value of 
the function at nontabulated arguments, it is necessary to interpolate. 
Moreover, the highly accurate tables at small increments of the argument 
that we take for granted today are mostly of comparatively recent origin. 
Therefore, classical numerical analysts developed an extremely sophisticated 
group of interpolation methods. Today the need to interpolate arises com- 
paratively seldom; e.g., on digital computers we almost always generate the 
value of a function directly rather than interpolate in a table of values (see 
Chap. 7). And when the need to interpolate in a table does arise, the small 
increments in the arguments in most tables mean that quite simple 
techniques, e.g., linear or quadratic interpolation, will usually suffice. Thus, 
while every numerical analyst must know how to interpolate, he will seldom, 
if ever, have use for the more sophisticated interpolation techniques. 

Why then start the main body of this book with a chapter on interpola- 
tion? The answer to this question is provided by the second of the reasons 
mentioned at the beginning of this section, namely that interpolation for- 
mulas are the starting points in the derivations of many methods in other 
areas of numerical analysis. Almost all the classical methods of numerical 
differentiation, numerical quadrature, and numerical integration of ordinary 
differential equations are directly derivable from interpolation formulas. 
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While modern numerical analysis does not rely so heavily on interpolation 
formulas in these areas, their importance and usefulness are still great, as we 
shall see in Chaps. 4 and 5. This, then, is ample motivation for treating 
interpolation at the outset of this book. 

Because we are especially interested in digital-computer applications, 
our approach to interpolation will not emphasize interpolation formulas 
based on difference techniques since they are seldom used on computers. 
Nevertheless, we shall not ignore finite differences because of their great use- 
fulness in hand computation and, even on digital computers, for certain 
applications (see, for example, Sec. 4.13-1). 

Suppose we have a function f(x) which is known (perhaps along with 
certain of its derivatives) at a set of points. These points will hereafter be 
called the tabular points because interpolation so often takes place in a table 
of functional values. The object of interpolation is to estimate values of the 
function at nontabular points and—at least—to bound the error between the 
estimated and true values. Our approach will be to approximate f(x) by a 
function y(x) which, at the tabular points, has the same values as f(x) (and 
perhaps the given derivative values, if any). Thus, in the language of the 
previous chapter, we shall be using exact approximations. In this chapter we 
shall consider only the case where y(x) is a polynomial or a function which is 
a piecewise polynomial. In the last section of Chap. 6 we shall consider the 
case in which y(x) is a linear combination of trigonometric functions. 

We shall usually be concerned with interpolation using only values of 
the function at the tabular points. Thus our interpolation formula has the 
form 


Fe) = ¥ Uefa) + Ble) = ylx) + BC) (3.1-1) 
although the more general formula 
f(x) = x x Ay (x) f(a,) + E(x) (3.1-2) 


is also of interest, particularly some special cases of it. Our object is to 
determine the /{x) so that 

E(a))=0 j=1,...,n (3.1-3) 

independent of the function f(x). In general, however, for nontabular points 

E(x) #0 (3.1-4) 


Our two aims then, are to determine the |;(x) so that (3.1-3) is satisfied and 
to find a representation for E(x) which will enable us to estimate or at least 
bound the error for values of x # a,, j= 1,..., n. 
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3.2 LAGRANGIAN INTERPOLATION 


In this section we consider the case where there are no restrictions on the 
spacing of the tabular points. In Sec. 3.3 we shall consider the case of equally 
spaced abscissas. Even in the general situation we consider here, however, 
the determination of the polynomials /;(x) is straightforward. Since we wish 
the error at the tabular points to be zero independent of f (x), it follows using 
(3.1-1) and (3.1-3) that 


1 (a,) = 55 jk= 1, i (3.2-1) 


where 6,, is the Kronecker delta.t Since I(x) is to be a polynomial, this 
requires that it have a factor 


(x — ay) — ag) ++ (x — ajax — ajy1) (X—,) (3-2-2) 
and since I,(a;) = 1, we can write 
_ (x — ay) ++ (= aj- 1) — jn) ( — a) - 
ix) = (a;—a,)°: (a; — aj—1)(4; — aj41) °° (a; — a,) 823) 


Note that there are other possible polynomial representations of |,(x) but 
(3.2-3) is the only possible polynomial of degree n — 1 and no polynomial of 
lesser degree is possible (why ?). It is notationally convenient to write /,(x) as 


___ P(X) '(q.) — _ 
He) = Gaya) PMO) de ben G24) 
where P(X) = 7 (x — a;) (3.2-5) 


To find an expression for E(x), we consider the function 
F(z) =f(2) — le) — Lfte) — veo) 24 (3.26) 


with y(x) as in (3.1-1). The function F(z) as a function of z has n + | zeros at 
the points a,,..., a, and x [assume for now that x in (3.2-6) is not one of the 
tabular points]. Therefore, by applying Rolle’s theorem n timest{ 


n! 

F(z) = f(z) — y(z) — (f(x) — y(*)] lt) (3.2-7) 
has at least one zero in the interval spanned by a, ..., a, and x. Calling this 
zero z = € and noting that y(z) = 0 since 1,(z) is a polynomial of degree 
n— 1, we have 


t 6,, = 0 unless j = k, in which case 6, = 1. 
t Here and throughout the book, we shall assume that the functions involved are differen- 
tiable as many times as necessary for the discussion. 
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0 = FE) = F°(8) — [f(x) ~ vb] (3.28) 
from which, using (3.1-1), it follows that 
(xc) = Pal®) mee) (3.2-9) 


nN. 


where €, which is an unknown function of x, lies in the interval spanned by 
a,, ..., a, and x. Although x in (3.2-6) was restricted to be a nontabular 
point, E(x), as given by (3.2-9), holds for both tabular and nontabular points 
(why ?). 

Equation (3.1-1) with the I(x) given by (3.2-4) and E(x) by (3.2-9) is 
called the Lagrangian interpolation formula. When n = 2, y(x) is the familiar 
formula for linear interpolation 
x— a, 


X— a, 
= a + — 
y(x) a, — a,/ | 1) a, — a, 


f (a2) (3.2-10) 


The polynomials I(x) are called Lagrangian interpolation polynomials. Our 
derivation of the Lagrangian formula has been equivalent to finding that 
polynomial of degree n — 1 which passes through the points [a,,f(a;)],j = 1, 
...,n {3}. Therefore, as we would expect, (3.2-9) indicates that this formula is 
exact, that is, E(x) = 0 for all x, for polynomials of degree n — 1 or less. In 
general, an interpolation formula which is exact for polynomials of degree r 
is said to have an order of accuracy r or to be of order r. 

The use of the Lagrangian interpolation formula is straightforward. To 
estimate f(x) at a nontabular point, we merely compute y(x) as given by 
(3.1-1) using (3.2-4) and (3.2-5) to compute the polynomials /,(x). If we can 
estimate or bound the nth derivative of f(x), then the error can be estimated 
or bounded using (3.2-9). 


Example 3.1 Let f(x) = In x. Given the table of values 


Inx | —.916291 — .693147 — 356675 — 223144 


estimate the value of In .60. 
With a, = .40, a, = .50, a, = .70, and a, = .80, we calculate from (3.2-4) 


1,(.60)= —% 1,(60)=4 = 1,(60)=4F =I, (.60) = —8 
and from (3.1-1) we get the approximation 
In .60 = —.509975 
The true value is In .60 = —.510826. From (3.2-9) we get 


p,(60) -6  —.0004 1 
B60) = A ea 
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In the interval (.4, .8), 10*/4096 < 1/é* < 10*/256, so that 
~z56 > E(.60) > —3k 


and indeed the difference between the approximate and true values lies within this error. 


An alternate approach to that in this section which must lead to the 
same polynomial (why?) and which is more convenient in certain applica- 
tions is given by the Newton interpolation formula using divided differences 
{22}. 


3.3 INTERPOLATION AT EQUAL INTERVALS 


In many applications of interpolation, the tabular points are equally spaced. 
For this reason it is worthwhile to consider the simplifications of the 
Lagrangian formula that can be made in this case. 


3.3-1 Lagrangian Interpolation at Equal Intervals 
Let the equal spacing be h, so that 
Aj4,—a;=h j=l,...,n-1 (3.3-1) 


For reasons of symmetry and computational convenience, it is common to 
take n odd and let 


x =a,+hm (3.3-2) 


where r = (n + 1)/2. Thus m =0 corresponds to the center of the interval 
spanned by the tabular points. Using (3.3-2), p,(x) and I;(x) can be expressed 
as functions of m. In particular, from (3.2-3) it follows that 1,(m) is indepen- 
dent of h and can thus be tabulated as a function of m. When we use (3.3-2) 
and write f(a, + hm) as f(m), the Lagrangian interpolation formula becomes 
Fm) = Fy) f(a) +P po) (3.3-3) 

jz 


n! 


where 
p,(m) = (m—r t+ 1)(m—r+2)---m(m+1)---(m+r—1) (3.3-4) 


Table 3.1 is a short tabulation of the Lagrangian interpolation polyno- 
mials 1(m) for n = 5. Clearly, when m and n are such that the /,(m) are 
tabulated, the use of (3.3-3) is quite straightforward on a hand calculator. On 
a digital computer, it will seldom be convenient to store such a table; rather 
it will be easier to generate the values of /,(m) using (3.2-4). 
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Table 3.1 Values of the Lagrangian interpolation polynomials 
for n= 5 (x = a; + hm) 


m I, (m) 1,(m) 1,(m) 1,(m) 1,(m) 


0 .0000 .0000 1.0000 .0000 .0000 0 
2 0144 — .1056 9504 1584 — .0176 —.2 
4 0224 — .1536 .8064 3584 — .0336 — 4 
6 0224 — .1456 5824 5824 — 0416 — .6 
8 0144 — .0896 3024 8064 — .0336 — 8 
1.0 .0000 .0000 .0000 1.0000 .0000 — 1.0 
1.2 — .0176 1024 — .2816 1.1264 0704 —1.2 
1.4 — .0336 1904 — .4896 1.1424 .1904 —14 
1.6 — 0416 2304 — .5616 .9984 3744 — 1.6 
1.8 — 0336 1824 — .4256 6384 6384 —1.8 
2.0 .0000 .0000 .0000 .0000 1.0000 — 2.0 


15(m) 1,(m) 1,(m) L,(m) 1,(m) m 


Example 3.2 Using the same data as in Example 3.1 plus the true value of In .60, estimate 
the value of In .54. 
We have h = .1; using Table 3.1 with m = —.6, we get from (3.3-3) 


In .54 = —.0416 In .40 + .5824 In .50 + .5824 In .60 — .1456 In .70 
+ 0224 In 80 = —.616143 


whereas the true value is —.616186. 


When the values of I,(m) are not tabulated, for hand computation, in- 
stead of (3.3-3) it is preferable to use the finite-difference interpolation for- 
mulas, which we shall discuss in Sec. 3.4. Before proceeding to discuss finite 
differences, however, we emphasize that there is one and only one polynomial 
of degree n — 1 that takes on the values of f (x) at the n tabular points (why ?). 
In what follows, we shall write interpolation formulas in a form very differ- 
ent from (3.1-1) or (3.3-3). But as long as these formulas involve polynomials 
passing through the same nv tabular points, they will be identical to the 
Lagrangian interpolation formula. 


3.3-2 Finite Differences 


In textbooks on classical numerical analysis, the calculus of finite differences 
and the interpolation, differentiation, and integration formulas based on it 
were always of central importance. This is because, for work on desk calcula- 
tors, finite differences are a wonderfully convenient tool. Aside from their 
advantages for hand computation, there are certain special applications for 
which finite differences are invaluable (see, for example, Sec. 4.13-2). Also 
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they are used extensively—although generally in a quite simple form—in the 
numerical solution of partial differential equations and boundary-value 
problems of ordinary differential equations on digital computers. 


Definitions As in Sec. 3.3-1, let the interval between successive tabular 
points be h. Then we define: 


1. The kth forward difference of f(x) as 
A'f (x) = Ak 'f(x + h)— A Tf (x) =k =1,2,... 


A°f (x) =f (x) (3.3-5) 
Thus, for example, 
A’f (x) = Af (x) =f (x + h) —f(x) (3.3-6) 


A?f (x) = Af(x + h) — Af(x) =f (x + 2h) — 2f(x +h) + f(x) (3.3-7) 


In fact, it should be clear from this definition that any order difference can 
be written as a linear combination of functional values as in (3.3-6) and 
(3.3-7). The general form of this linear combination, whose derivation we 
leave to a problem {7}, is 


A‘f (x) = DAG ya(!) £6 + kh) (3.3-8) 
where the binomial coefficient (;) = Gr 


2. The kth backward difference as 
VF (x) = VEO UF (x) — VEO (x — hh k =1,2,... 
“la = VF) — Fe 53.9 
V'f (x) =f (x) 
3. The kth central difference as 
O*f (x) = d*-1f (x + th) — d*- If (x — th k=1,2,... 
f(x) = SF If (x + Hh) — SF If (x — Hh) 63-10 


5°f (x) =f (x) 
Note that if x is a tabular point, only even central differences involve 
tabular points (why ?). 


A property of differences that we shall have use for later is that the first 
difference of a polynomial of degree n is a polynomial of degree n — 1 {8}. 
Therefore, the nth difference of a polynomial of degree n is a constant, and the 
(n + 1)st difference is identically zero. The properties of finite differences and 
the formulas based upon them can be derived by operational calculus using 
the difference operators A, V, and 6; we leave a consideration of this 
approach to a problem {9}. 
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The lozenge diagram In the remainder of this section, we shall denote 
A‘f(a,) by A/f, with a corresponding notation for backward and central 
differences. Furthermore, we shall change our previous notation slightly and 
let the tabular points have both positive and negative subscripts. When we 
calculate differences, it is convenient to set up a difference table, as in Fig. 3.1, 
in which each entry after the second column is the difference of the two 
immediately to its left. The use of forward differences in the table is arbi- 
trary; backward differences could just as easily have been used (but not 
central differences—why ?). 


a_4 Sea A’*f_s A*f_« 

Af_ 4 A*f_s A*f_¢ 
a_y f-3 A’f_, A‘f_. 

Af_; A*f_4 A*f_s 
a, f-2 af A’f_3 ay A*f_, ayy 
a, fy Oh. 

Af_, A*f_, A*f_, 
ao So A’f_, A*f_> 

Afo A*f_, A*f_, 
ay fi A*fy A*f_, 

Af, A*fy A*f_, 
a2 Sa Af A*f, AY A*fy Ay, 
a; Ss A*f, A*f, 

Af, Af, A*f, 
a, Sa A*f, A*f, 


Figure 3.1 Forward difference table. 


Example 3.3 Using the data of Example 3.2 with one point added at either end, compute 
the difference table. 
The result is 


x Inx A A? A? A‘ A° A® 
30 =~ 1.203973 
287682 
40 —.916291 — 064538 
223144 023715 
50 —.693147 — 040823 — 011062 
.182321 012653 005959 
60 —.510826 — 028170 — .005103 — .003534 
154151 007550 002425 
10 —.356675 — .020620 — .002678 
133531 004872 
80 —.223144 — 015748 
117783 


90 —.105361 
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Figure 3.2 The lozenge diagram. 


If to Fig. 3.1 we add connecting lines and binomial coefficientst as in 
Fig. 3.2, we can use this modified difference table, called a lozenge or Fraser 
diagram, to generate most of the interesting finite-difference interpolation 
formulas. To generate such an interpolation formula, we proceed as follows: 


J. Start at an entry in the first (functional value) column and proceed 
along any path in the lozenge diagram; 1.e., if a segment terminates on a 
difference, the path may be continued along any of the other three paths 
leading from the difference. End the path at any difference. 

II. Then construct the formula by: 

A. Writing down the functional value at which the path started. 

B.1. For every left-to-right segment in the path add a term consisting of 
the difference on which the segment terminates multiplied by the 
binomial coefficient directly below this difference if the slope of the 
segment is positive and directly above if the slope of the segment is 
negative, and 


(m+ k)\m+k—1)-::(m+k—n4 1) -("* 4 


t (m+ k), = Cf 


n! 


In this section we let m be such that x = ay + hm [cf. (3.3-2)]. 
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2. For every right to left segment subtract a term consisting of the 
difference at which the segment originates multiplied by the bino- 
mial coefficient directly below this difference if the slope of the seg- 
ment is positive, i.e., if the segment goes downward and to the left, 
and directly above if the slope is negative. 


These rules imply that if at a given difference we change direction from 
right to left to left to right, this difference does not appear in the interpola- 
tion formula. As an example of the opposite situation, the path 


(m), 
NN 


Afo 
S/o (m— 1); 


(m — 1), Afo — (m), Afo 


For example, starting at f, , proceeding along lines sloping downward to 
the right and terminating with the nth difference, we get, writing y(a) + hm) 


as y(m), 


gives rise to the terms 


y(m) = fo + (m), Afo + (m)2 Af 
40+ (im), Ay = 3 (m); Af (33-11) 


This formula, called Newton’s forward formula, will be discussed in more 
detail below. 

The value of the procedure outlined above is contained in the statement 
that any formula derived by this procedure which terminates with an nth 
difference is algebraically equivalent to an equal-interval Lagrangian formula 
which uses the tabular points involved in the terminating difference. [For 
example, the nth difference in (3.3-11) involves the points ap, ..., a,; see 
(3.3-8).] The proof of this assertion requires that we show that 


1. At least one formula has this property. Below we shall prove that 
Newton’s forward formula has the desired property. 

2. All formulas which terminate with the same difference no matter by what 
path they reach that difference are algebraically equivalent. We leave the 
proof of this to a problem {10}. 


Finite-difference interpolation formulas We prove first that Eq. (3.3-11), 
Newton’s forward formula, is algebraically equivalent to the Lagrangian 
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interpolation formula at equal intervals for then + 1 points ay,..., a,. Since 
(m), is a polynomial of degree n in m, it is sufficient to prove that y(i) in 
(3.3-11) equals f;, i = 0, ..., n, for then y(m) would be the unique polynomial 
of degree n passing through the n + 1 points f;. Using (3.3-8) in (3.3-11), we 
get 


H)= S0,= & S(—yrna(i) 
-y y-oa(;) fo i=O,..,n (3.312) 


The coefficient of f, in y(i) is then given by 
n np: j 
5-9’) (33-13) 
jr 


For r > i this coefficient is zero since (i); = 0 if i<j. When r = i, the only 
nonzero term in (3.3-13) is that for j = r and equals 1. When r < i, (3.3-13) 
can be written 


Seor() a 


which, by suitable manipulation {12}, can be shown to vanish. Thus, the 
right-hand side of (3.3-12) is just f;, which completes the proof. 

Using the lozenge diagram, we can generate the following interpolation 
formulas. 


Newton’s backward formula Starting at fo and proceeding along lines slop- 
ing upward and to the right, we get 


y(m) = fo + (m), Af_1 + (m+ 1). A*f_. 
+-+-+(m+n-—1), A'S, (3.3-15) 


which is equivalent to a Lagrangian formula using the points ay, a_,..., 
a_,- In fact, this formula is more conveniently expressed in terms of back- 
ward differences {13}. 


Gauss’ forward formula Here we proceed in a zigzag, downward and to the 
right, then upward and to the right, then downward and to the right, etc. The 
result is 
y(m) = fo + (m), Afo + (m)2 A’f_, 
+ (m+ 1), A*f_, + (m+ 1), A*f_.4+°°° (3.3-16) 


Gauss’ backward formula Here we proceed as in Gauss’ forward formula 
except that the first step is upward and to the right. The formula is 
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y(m) = fo + (m), Af, + (m+ 1), A*f_y 
+ (m+ 1); A°f_,+ (m+ 2), A*f_.+°°° (3.3-17) 


Both Gaussian formulas are conveniently expressed in terms of central 
differences {13}. 

Because of the result stated in Sec. 3.3-2, each of these formulas is alge- 
braically equivalent to the Lagrangian formula which uses the same tabular 
points. The errors in these formulas therefore are given by (3.2-9). In the next 
section we shall indicate why it is useful to be able to express the same 
interpolation formula in a number of different forms. 

If we take the mean of Gauss’ forward and backward formulas as given 
by (3.3-16) and (3.3-17), we get Stirling’s interpolation formula 


y(n) = fo + = (Afo + Af-1) + S(rm)2 AF 


+ (m+ 1), Af_.J 4°" (3.3-18) 


Stirling’s formula can be conveniently expressed in terms of central differ- 
ences {13}. 

Bessel’s interpolation formula is the mean of the Gaussian forward for- 
mula given by (3.3-16) and a Gaussian backward formula launched not from 
fo but from f,. It has the form 


y(m) = 3(fo + fi) + (m — 4) Mfo + 5(m)2(A*fo + A7fi) + -°> (3.319) 


Note that when (3.3-17) is modified to consider launching from f,, m must be 
replaced by m — 1 so that the origin of m is still at ag . Bessel’s formula is also 
conveniently written using central differences {13}. 

Some other interpolation formulas which can be obtained by manipulat- 
ing the ones derived in this section are considered in a problem {14}. 


3.4 THE USE OF INTERPOLATION FORMULAS 


With the exception of Stirling’s formula terminated with an odd difference 
and Bessel’s formula terminated with an even difference, all the interpolation 
formulas we have derived are algebraically equivalent over the same set of 
tabular points. For equally spaced data, the ease with which difference tables 
can be generated makes the finite-difference interpolation formulas more 
convenient than the Lagrangian formula for hand computation. To get some 
insight into which of the finite-difference formulas to use in a given applica- 
tion (why does it matter if they are all equivalent ?), let us consider interpola- 
tion in a table of values. 

One of the great advantages of the finite-difference interpolation for- 
mulas is the ease with which added terms of the formula can be used merely 
by calculating higher differences in the table of Fig. 3.1. For example, if we 
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add the value of In x at 1.0 to the table of Example 3.3, we can calculate a 
new row of differences and thereby get a difference of order 7 in the table. 
Commonly, we do not know a priori how many terms in a given interpola- 
tion formula will be sufficient to achieve the accuracy we desire. Therefore, 
we generally add terms to the formula by computing higher differences until 
the contribution of the added terms is so small that the number of decimal 
places of interest to us has stabilized. [If by use of (3.2-9) we can bound the 
error, all well and good; often, however, it will be difficult to estimate, much 
less bound, the derivative term in (3.2-9).] It is desirable then to use that 
interpolation formula which gives the best results at every stage of the 
computation. 

Consider the problem of estimating In .65 using the data in the differ- 
ence table of Example 3.3. Suppose that a priori we do not know how many 
differences will be required to obtain the accuracy we need.f If all the data in 
the table will be required to achieve the desired accuracy, it makes no 
appreciable difference which finite-difference interpolation formula we use 
because all will be algebraically equivalent. But if it is possible that a 
sufficiently accurate result can be obtained using fewer than six differences, 
we should choose our interpolation formula with some care. 

Let us compare the use of Newton’s backward formula and Gauss’ 
forward formula. If we may need to use all the data in the table, the Newton- 
ian formula must use x, = .9, that is, m = —2.5 and the Gaussian formula 
must use Xp = .6 (m = .5). But while these two formulas will be algebraically 
equivalent if terms through the sixth difference are used, for smaller numbers 
of terms, they will not be equivalent {17!. Therefore, which should we 
choose? 

This question is most easily answered by considering the error term 
(3.2-9). The only term in the error that we can control is the p,(x) term. To 
minimize the magnitude of p,(x), we should choose the tabular points so that 
the value of x at which we wish to interpolate is as near as possible to the 
center of the interval spanned by the tabular points (why?). Therefore, the 
answer to our question is that the Gaussian formula is to be preferred in 
the above example because when the number of differences used is small, it 
more nearly satisfies the condition above than the Newtonian formula. 

From the above it follows that Newton’s backward formula has its chief 
value when we wish to interpolate near the end of a table, for in this case 
there would not be a sufficient number of differences available for the Gaus- 
sian formula. For example, to estimate In .85 using the data of Example 3.3, 
if we used Gauss’s forward formula with x, = .8, we could only use the terms 
through the second difference {18}. Similarly, Newton’s forward formula is 
chiefly valuable near the beginning of a table. But when there are a substan- 
tial number of tabular points available on either side of the interpolation 


+ For such a simple function as this, we could, of course, estimate the error using (3.2-9). 
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point, a Gaussian formula is more desirable than either Newtonian formula. 
In particular, Stirling’s formula (which is just the average of two Gaussian 
formulas) terminated with an even difference (so that it is equivalent to a 
Lagrangian formula) is useful when m can be chosen near zero; similarly, 
Bessel’s formula terminated with an odd difference is useful when m is near 3. 
The justification for these conclusions is considered in {18}. 


Example 3.4 Use Newton’s backward formula with x, = .9 and Gauss’ forward formula 
with x9 = .6 to find an estimate of In .65. 
Using (3.3-15), (3.3-16), and the data of Example 3.3, we can construct the following 


table: 

Number of Newton’s backward Gauss’ forward 
differences used formula (m= —-—2.5) formula (m = .5) 
0 — 105361 — .510826 

1 — 399819 — .433751 

2 — .429346 — .430229 

3 — 430869 — .430701 

4 — .430762 — 430821 

5 — .430791 — .430792 

6 — 430774 — .430775 


The true value is In. 65 ~ —.430783. As we expect, when the number of differences is 
small, the Gaussian formula is more accurate than the Newtonian formula, although both 
give the same value, except for roundoff, when all six differences are used. (Why is the 
Newtonian formula more accurate when four differences are used?) Using (3.2-9) we can 
verify that the error at every stage is within expected bounds {17}. 


The reader may well think that any desired degree of accuracy could be 
achieved merely by increasing the number of terms used in any interpolation 
formula (finite difference or Lagrangian).t In fact, interpolation series 
formed by letting the number of tabular points go to infinity are generally 
only asymptotically convergent; i.e., as we add more points the error first 
decreases and then at some point starts to increase and grow without 
bound.t One reason for this eventual divergence of interpolation series is 
connected with the fact that the nth derivative of all but some entire func- 
tions (functions with no singularities in the complex plane) eventually grows 
without bound as n increases (see Sec. 4.9). Even for entire functions, 
however, the interpolation series may fail to converge {19}. We note that in 
practice the desired degree of accuracy in interpolation can almost always be 
achieved; i.e., the asymptotic convergence is generally very good indeed. 


+ We ignore here the fact that the growth of roundoff error with higher differences limits the 
accuracy attainable with finite-difference interpolation formulas. 

} The classic example of this behavior is the very well-behaved function f(x) = 1/(1 + x7), 
which is considered by Steffensen (1950, pp. 35-38). 
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The discussion above deals with the case in which the interval spanned 
by these points increases without bound as more tabular points are added. 
Suppose now instead that we have a fixed finite interval [a, b] and a se- 
quence {S,} of sets of interpolation points 


Sy = (x, j =1,..., n} 


such that each S, is contained in [a, b]. If p,(x) is then an interpolation 
polynomial based on S,, we would expect that if the maximum distance 
between adjacent points x‘” goes to zero as n goes to 00, the p,(x) would 
converge uniformly to f(x) as n goes to oo. However, this is not necessarily 
the case, as indicated by the classical example of Runge, in which S, consists 
of equidistantly spaced points in [—1, 1] and f(x) = 1/(x? + 25). This func- 
tion is quite smooth and is differentiable infinitely often, but the p,(x) do not 
converge uniformly to f(x). This nonconvergence is called the Runge effect. 
On the other hand, there are sequences {S,} for which the p,(x) do converge 
uniformly to f(x) provided only that f(x) satisfies some quite mild condi- 
tions. One such is the sequence of Chebyshev nodes 
4-1 
S, = — cos I= 1 j = 1,...,7 

for which the p,(x) converge uniformly to f(x) on [—1, 1] provided that f(x) 
has a bounded derivative on this interval. 


3.5 ITERATED INTERPOLATION 


An important advantage of finite-difference interpolation formulas over the 
Lagrangian formula would seem to be the property of the former that 
enables a term to be added to them merely by adding one tabular point and 
computing an additional row of differences. As we demonstrated in Example 
3.4, this enables us to generate a sequence of interpolants each one involving 
one more tabular point than the previous one. Therefore, the convergence of 
the interpolation procedure can be tested easily. But suppose, given the 
Lagrangian interpolation formula using n points, we wish to add one point 
to get higher accuracy. A look at (3.2-4) indicates that even if we have saved 
the values of p,(a;),j = 1,...,n, each I(x), j = 1,..., n, requires some recalcu- 
lation and we must also calculate J,,, ,(x). Our purpose in this section is to 
show how this seeming disadvantage of the Lagrangian formula can be 
overcome. We shall do this by using iterated interpolation, in which a se- 
quence of interpolants in the Lagrangian context is generated without the 
need for substantial recalculation of coefficients when going from nton-+ 1 
points. 

Denote by y,, .... ,,(x) the Lagrangian interpolation formula using the 
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points a,,, ..., d,,, which we do not require to be equally spaced. Then in 
particular we can write 
I yi,2 n—1(X) a,-1 —%* 
=————}; "rr" 3.5-1 

MA, 2 on nl) Qn — An—1 1} Yi, 2,...,n-2, n(X) a, — Xx ( 
This equation can be verified by noting that the right-hand side, which is a 
polynomial of degree n — 1, takes on the values f(a;) at the points a;, i = 1, 
..., n. Equation (3.5-1) then indicates how a Lagrangian formula of order n 
can be generated from lower-order formulas. By use of the following table, 
we can generalize the result of (3.5-1) to achieve our object [note that 


yi(x) = f (a,)]: 
a, a,—xX y,(x) 
A, A,—X y2(x) yi, 2(x) 
a3 @3—X ys(x) y1,3(%) Yi, 2, 3(%) (3.5-2) 


a, A,—x yn(X) V1, n(X) V1, 2,n(X) °** Yi, 2,3,...,0(%) 


The entries in each column of the table can be generated from the entries in 
the previous column by analogy with (3.5-1). For example 


_ 1 Y1,2(x) a,—-x 
a, — a2 V1, A(X) a, —X 


The entries on the diagonal in (3.5-2) are just what we were seeking. They 
form a sequence of Lagrangian interpolants each of which incorporates one 
more tabular point than the previous one. Further, since each entry in (3.5-2) 
is calculated using a formula analogous to (3.5-1), the process is easily 
mechanized. Iterated interpolation is thus well suited to digital-computer 
application and, for points not equally spaced, is also convenient for hand 
computation. 


V1, 2,n(X) (3.5-3) 


Example 3.5 Use iterated interpolation to calculate In .54 using the data of Example 3.2. 
Corresponding to the table (3.5-2), we get 


a; a;—-x y(x) 

.40 —.14 — 916291 

50 — 04 — 693147 — .603889 

.60 06 — .510826 — 632466 — 615320 

10 16 — 356675 — 655137 — 614139 — .616029 


.80 .26 —.223144  —.673690 —.613196 -—.615957 —.616144 
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We are not bound to use the natural order of the points as above. Consider instead 
interated interpolation using the ordering below: 


a; a;— x y() 

50 —.04 — .693147 

.60 06 — 510826 — 620219 

40 —-.14 — .916291 — .603889 — .615320 

70 .16 — .356675 — .625853 — 616839 — .616029 


80 .26 — .223144 — .630480 —.617141 — .615957 — .616144 


The difference in the two final values from the result of Example 3.2 is the result of 
roundoff since the same five points were used in all computations. Note, however, that the 
first interpolant (—.620219) in the second calculation is substantially more accurate than 
that in the first calculation (—.603889). This occurs because in the second computation we 
arranged the data so that the magnitudes in the a; — x column would be increasing. In this 
way the value of p,(x) in the error term is minimized at every stage (cf. the discussion of 
Sec. 3.4). Therefore, if we order the tabular points so that the magnitudes in the a; — x 
column are increasing, each interpolant tends to be the best possible [“ tends ” because the 
value of the derivative in the error may be greater when p,(x) is smaller]. In this way the 
convergence of the interpolation (as judged by the difference in two successive interpolants 
or by the stabilization of a certain number of decimal places) will tend to be most rapid. In 
this example only the first interpolant is improved because only the tabular point at .40 is 
out of the best possible order. 


3.6 INVERSE INTERPOLATION 


In Chap. 8 we shall be concerned with the solution of the general nonlinear 
equation f(x) = 0. One of our basic tools in the solution of this equation will 
be inverse interpolation, which we now consider briefly. The solution of 
f(x) =0 is one example of the common numerical problem of finding the 
zero of a function. Another case where this occurs is in the numerical inte- 
gration of an ordinary differential equation (see Chap. 5), when we would 
like to know that value of the independent variable for which the dependent 
variable, i.e., the solution of the differential equation, is zero. Inverse interpo- 
lation provides us with a straightforward and powerful way to find such 
zeros of functions. 

Let the function whose zero (or zeros) we wish to find be y = f(x) and 
suppose it is tabulated at a series of points (which need not necessarily be 
equally spaced), so that we have 


(3.6-1) 


+ We note in passing that even the Newton-Raphson method for the solution of f(x) = 0 
can be considered to be an application of inverse interpolation; see Chap. 8. 
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Now let us suppose that on the interval [x,, x,], f(x) satisfies the conditions 
of the inverse-function theorem, ie., in particular that f’(x) # 0, so that we 
can write x = g(y), where g is the function inverse to f. Therefore, finding the 
value of g(0) is equivalent to finding a zero of f(x). To estimate g(0) we first 
write the table (3.6-1) as 


(3.6-2) 


Now in the context of interpolation let f(x,), ...,f(x,) be the tabular points 
of the independent variable y (not equally spaced in general) and let x,,..., 
x, be the functional values at these points. Then, if we use a Lagrangian 
interpolation formula to approximate g(y) by a polynomial and then inter- 
polate at the point y = 0, we get the desired approximation to « = g(0). 


Example 3.6 Given the data 


find an approximate value of the zero of f(x) between .3 and .4. 

Our approach will be to use iterated interpolation. We therefore first arrange the 
data in order of increasing magnitude of f(x) (cf. Example 3.5) and then use the technique 
of the previous section to generate the table 


f(x) F (x;) — 0 x; 


.10810 .10810 3 
— .17440 — .17440 4 33827 
40160 40160 2 .33683 .33783 
— 43750 — 43750 i) 33963 33737 33761 
.70010 70010 1 33652 .33792 33771 33765 


The data for f(x) are in fact values of the function x* — 3x + 1, which has a zero .33767 
correctly rounded to five places. 


Expressed in terms of g(y) the error in inverse Lagrangian interpolation 
is 


E(y) = Pal) g™(€) (3.6-3) 
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Derivatives of g can be expressed in terms of f and although this relation is 
not simple (see {1}, Chap. 8), there is a power of f’(x) in the denominator of 
each derivative of g(y), for example, 


; 1 nay — —F'(%) 
g (y) f'(x) 9g (y) = (f’ (x f'(x)}3 
Therefore, although we can carry through the process of inverse interpola- 
tion even if f'(x) vanishes in [x,, x,], we would expect the accuracy to be 
very poor in this case. Indeed, at a point at which f’(x) vanishes, the inverse 
function may not exist. When f’(x) vanishes near the zero, however, we can 
often find the zero by an iterative process involving linear inverse interpola- 
tion {27}. In general, linear inverse interpolation may even be preferable to 
higher-order inverse interpolation, which can be very problematical, since the 
quality of the interpolation depends on how well the inverse function can be 
approximated by a polynomial, but we usually have this information only 
for the original function and not for its inverse. 


3.7 HERMITE INTERPOLATION 


In this section we consider the case m; = 1 in (3.1-2) forj = 1,...,r, that is, we 
suppose that the first derivative as well as the function is known at r of then 
tabular points. In place of (3.1-1) we have then 


f(x) = LACS la) + Y hx) (x) f‘(a;) + E(x) = y(x) + E(x) (3.7-1) 
where now the approximation vf) is given by 
lx) = Yo h(x)f(a) + DRO") (3.72) 


and h,(x) and h,(x) are both polynomials. Again using the criterion of exact 
approximation, we require that the error term E(x) be such that 

E(a;) = 0 j=l,...,n 

E(a)=0 j=l1,...,7r (3.7-3) 


In analogy with Eq. (3.2-1) in the Lagrangian case, this leads to the follow- 
ing conditions that must be satisfied by the h,(x) and h,(x): 


h,(a,) = 6 jk=1,...,0n 

h,(a,) = 0 j=l,...,.nk=1,...,n 

hi(a,) = 0 jz=l,....n;k=1,...,7 
r 


hi(a,) = 5x j,k =1,..., (3.7-4) 


Since there are n + r conditions to satisfy in (3.7-3), we expect that y(x) will 
have to be a polynomial of degree n + r — 1; that is, we shall approximate 
f(x) by a polynomial of degree n + r — 1 passing through f(a,), j = 1, ..., n, 
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and having derivatives f’(a,), j = 1,..., r. In deriving the h,(x) and h,(x), we 
shall use the notation 
Pa(X) = (x — a1) +++ (x — a) 


P(x) = (x — ay) +" (x — a) 


Lin(X) — Ps (x 


my Gn 
(x ~ a;)p,,(a;) 


P-(x) | 
x) = ———*—— =l,...,7r 3.7-5 
) (x — a,)p,(a;) ) \ 
To satisfy the conditions on h,(x) we set 
t (Xl jn(X)Lie(X) J =1,... a 
h(x) = p,(x) 
w(x) ——* j=r+1,...,7 3.7-6 
word (3.7-6 


where ¢t,(x) is a linear polynomial so that h,(x) is of degree n+r-— 1. As 
given by (3.7-6), h,(x) satisfies all the conditions of (3.7-4) except h,(a;) = 1, 
j=1,...,7r, and hi(a;) = 0, j = 1,..., r. To satisfy these we must have 


l 


jr 


jJ=1,...,7 
ti(a;) + Fn(a;) + U-(a;) = 0 (3.7-7) 
Similarly, if we set 
h(x) — S(X)L p(X) Lin() J = I, vee P (3.7-8) 
with s,(x) a linear polynomial, we must have 
s(a))=O | 
j=1,...,7 (3.7-9 
say) = | | 


in order to satisfy (3.7-4). Linear functions satisfying (3.7-7) and (3.7-9) are 
easily found to be {28} 


t(x) = 1 — (x — a)[Un(a;) + U(a)] s(x) =x-— a; — (3.7-10) 


This completes the determination of h,(x) and h,(x). 
To find E(x) we proceed in a manner similar to the Lagrangian case. Let 


Pr(2)Pr(Z) 

F(e) =f (e) ~ yle) — LF (&) ~ via] PEE (3.7-11) 
with x not one of the tabular points. This function has n+r+ 1 zeros 
(double zeros at a,,..., a,, single zeros at a,,,,..., a, and x) so that by a 
generalization of Rolle’s theorem, there exists a ¢ in the interval spanned 
by a,,..., a, and x such that 

(n+r)! 


O= Ferg) =f *P(E) — [F() — y(x)] »,()P,(x) (3.7-12) 
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Thus E(x) = fea a ferr(é) (3.7-13) 


This relation also is correct if x is one of the tabular points (why?). 
The interpolation formula (3.7-1) then becomes 


Fs) = Yh) flay) + Ye f(a) + PATH pera) (3.7.14) 


j (n+r)! 
with 
{1 — (x — as) lin(aj) + Belay Nin )lje() 
h(x) = © j=l,...,r (3.7-15) 
p(x _, ; 
Inte) (a) jar then 
h(x) = (x — ajlp(x)(x) j=l...yr (3.7-16) 


and is called the modified Hermite interpolation formula. When r = n, the 
formula is 


f(x) = Lhieyfla) + DAO)S 


n Pat) pam (3.7-17) 
with h)=[- 26 aa) 
h(x) = (x — a) B(x) G.718) 


where we have replaced |,,(x) by /,(x). Equation (3.7-17) is the Hermite 
interpolation formula, also called the formula for osculatory interpolation. 

Both the Hermite and modified Hermite formulas can be useful interpo- 
lation formulas. They also serve as useful theoretical tools in other areas of 
numerical analysis, as we shall see in Chaps. 4, 5, and 8. 


Example 3.7 Given the table below of the natural logarithm and its derivative 


x In x 1/x 

.40 — 916291 2.50 
.50 — 693147 2.00 
.70 — 356675 1.43 
.80 — 223144 1.25 


estimate the value of In .60 using the Hermite interpolation formula. 
From (3.7-18) we get 


h,(.60) = 44 h,(.60) = Fi h3(.60) = ri h,(.60) = 44 
h,(.60) = tho h,(.60) = 45 h3(.60) = —#s h4(.60) = —7hs 
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and from (3.7-17) 
In .60 = —.510824 
whereas the true value is —.510826. Using (3.7-13), the error is bounded by 


l 
— .000031 = < E(.60)< —-=, ~*~ —.0000001 


= 
32768 223 


so that the excellent agreement between the interpolated and true values is to be expected. 


3.8 SPLINE INTERPOLATION 


As mentioned in Sec. 3.4, it is possible that a sequence of interpolation 
polynomials {p,(x)} over a fixed finite interval need not converge even to a 
smooth function. And if we investigate the behavior of these polynomials 
between the interpolation points we find that in many cases the polynomials 
oscillate quite violently while the function varies smoothly. And the higher 
the degree of the polynomial, the worse the situation becomes. A way to 
overcome this problem is by using piecewise low-order interpolating poly- 
nomials on subintervals of the given interval. With them, the oscillation 
between points is not significant, so that they can imitate the behavior of the 
function. However, the resulting function pieced together from the indivi- 
dual low-order polynomials may not be smooth. Since we wish to imitate the 
behavior of smooth functions, a requirement for these piecewise functions is 
that the resulting function pieced together be smooth. Such functions, with 
the maximal degree of smoothness, are called splines and we shall now 
describe them formally. 

Let the interval J = [a, b] be divided into n — 1 subintervals a = a, < 
a,<‘**<a,-,<a,=b not necessarily of equal length. A spline S(x) of 
degree m is a function defined on I which: 


1. Coincides with a polynomial of degree m on each subinterval J; = 
[a;_1, a;|, i= 2, weey M 
2. Has continuous derivatives up to order m — 1. 


The abscissas {a;} are called the nodes or knots of the spline. A spline S(x) is 
said to interpolate to the data points (a,, y;,) if S(a;) = y;, i= 1,..., n. 

The word spline derives from the instrument often used by draftsmen in 
fairing a curve through data points. The simplest spline, that of degree 1, 1s a 
piecewise linear function which is not very smooth but very useful if the 
spacing between nodes is small. In fact, every table of functional values in 
which linear interpolation is used leads to an approximation of the underly- 
ing function by a linear spline. Splines of degree 2 can be defined, but since 
there is only one degree of freedom in their definition, there is a lack of 
symmetry in their determination with relation to the endpoints of the inter- 
val. Furthermore, the resulting functions are not sufficiently smooth. Thus 
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the most prevalent spline in use is the cubic spline, which involves two 
parameters chosen to reflect the behavior at the endpoints of the interval. 
One type of end condition is that S"(a) = S"(b) = 0. The cubic spline which 
satisfies these conditions is called the natural cubic spline. A second condi- 
tion, which generally yields better results, is 


S'(a)=f'(a) — S'(b)=f'(6) 


This, of course, requires knowledge of the derivative at the endpoints. 

Since we shall subsequently be dealing exclusively with cubic interpola- 
tory splines, we shall drop the adjectives and simply write spline. One of 
several representations of splines can be derived as follows. Set 


h; = a; — a;- i = 2, 3,...,n (3.8-1) 


Since S(x) is piecewise cubic, S’(x) is piecewise quadratic and S”(x) is piece- 
wise linear and continuous. Hence, we can write 


a;j— x — G@i-1 


S"(x) = M,-. + M,~ ; on I, (3.8-2) 
for certain constants M;, where in fact 
S"(a;) = M; i=1,2,...,n (3.8-3) 


Integrating (3.8-2) twice and writing the arbitrary linear function in the form 
indicated, we obtain 


(a; — x)° 
6h; 


(3.8-4) 


on I,;. Since we wish the spline to interpolate at the knots, we have that 
S(a;- 1) = y;-, and S(a;) = y,. This determines the c; and d,, yielding 


S(x) = Mj-1 + M; 


(x — Q@;- 1)° 
6h 


ra ree 


3 
6h, tM 


+ (v.- — se h, 


M;h? xX — Q;- 
+(y- ‘ a (3.8-5) 


on I,. Differentiating (3.8-5), we obtain 


Ho (a; — x)? (x — a;-1) ¥i-Yi-1 M;-—M;i-1 
S'(x) = —M;-1 ah, + M; — h, 6 h; 
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on I;. We note the particular values 


_ h h i ~ 
S'(a; )= 2M. + 3M, +t oa 


h h; Yi+1 — Vi 
— 3 Mi- 5 Mini +} 


Since S’(x) is required to be continuous, these two values must be equal; 
this yields the equations 


(3.8-7) 


S'(a;") = 


hist 


h, hy + hiss hiss 
“ivy, , 4 i it wi+d 
i nr 


Mist _ iti Vi — Vi Vi-1 


his h; 


i=2,3,....n—1 


(3.8-8) 


These form a set of n — 2 linear equations in M,, ..., M,,, so that two more 
conditions must be added. Once the M’s are determined, the interpolation 
spline is completely determined through (3.8-5). We shall abbreviate 
Egs. (3.8-8) by setting 


Yi — Yi-1 


0; = h; i= 2, 3,...,n 
his 6(o;41 — 9;) 
© e+ hia - h; + his, 
i=2,3,....n—1 (3.8-9) 


We then get the set of equations 
yM;-, + 2M; +4;Mi4, = 4; i=2,3,...,.n—1 (3.8-10) 
We shall write the two additional conditions in the form 
2M,+4,M,=d, U,M,-, + 2M, = d, (3.8-11) 


and indicate below several possible choices of the constants /,, d,, ,, d,. 
The combined system now becomes 


2 4, 0 M, d, 
HM, 2 A, M, d, 
O pz 2 O M, d, 
, : | = (3.8-12) 
2 An-2 0 M,-2 d,-2 
C) Hn-1 2 An-1 M,-1 
0 HM, 2 M, d, 


with a tridiagonal matrix. 
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1. If one selects 4, = d, = np, = d, = 0, then M, = M, =0. This yields the 
natural spline as defined above. 
2. Using the second endpoint condition proposed above 


S(aj=y, S(b)=y, (3.8-13) 
we obtain A, =1=4h8, d, = ee =_— ¥ 
h,\ hy 
6 Yn — Yn-1 
d — ——— / — as e -1 
| y. ; (3.8-14) 


3. A third possibility is to choose a;, i = 2,...,n — 1, as the spline nodes, i.e., 
not require the endpoints to be nodes, and to impose the conditions 
S(a,) = y, and S(a,) = y,. 


In this case, the index i in (3.8-10) runs from 3 to n — 2, and the two 
additional conditions are similar in form to those of (3.8-11). The values of 
A, and d, can readily be evaluated by letting x = a, in (3.8-5) and setting 
S(a,) = y,. Similarly, by setting x = a,, we can find the values of A,_ , and 
d,_, {30}. This scheme, which does not require any additional information, 
appears to be better than choice 1 for the average function that comes up in 
practice since the second derivative does not generally vanish at the end- 
points of the interval. 

From (3.8-9) we see that 0 <A, < 1 and0 <y,; < 1 fori=2,...,n—1. 
If, therefore, |A,| <2, |u,| <2, the matrix in (3.8-12) will be diagonally 
dominant. In this case, it can be shown (see Sec. 9.1) that unique solutions to 
(3.8-12) will exist for arbitrary d,, ..., d,. Thus, in cases 1 and 2, we are 
assured of the existence of the spline. Similarly, in case 3, a solution always 
exists {30}. 

Splines have the following important properties which can be readily 
proved {31, 32}. 


1. Given data points (a;, y;) i= 1, ..., m Of all the functions f(x) with 
continuous second derivatives which interpolate to these data the spline 
S(x) which also satisfies S’(a) = S”(b) = 0 uniquely minimizes the integral 


(a) = | la"? ae (3.8-15) 


Similarly, of all the functions f(x) with continuous second derivatives 
which interpolate to these data and satisfy f’(a) = y,,f'(b) = yi, the spline 
S(x) which also satisfies (3.8-13) uniquely minimizes (3.8-15). 
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2. If we define h = max; h; and let S(x) be the natural spline interpolating 


f(x) at a;, i= 1, ..., n, where f(x) has a continuous second derivative, 
then 


max | f(x) ~ S(x)| < A[REL/)]” (3.8-16) 
max | f'(x) ~ S(x)| < [AE f)}*” (3.8-17) 


A similar theorem holds for the spline S(x) satisfying S’(a) = f'(a), 
S'(b) =f'(b). 

A stronger, albeit only asymptotic, result states that if f(x) has a contin- 
uous fourth derivative, and if max, (h/h,) < B < co as h-0 for a fixed B, 
then 


max | f(x) — S(x)| = O(h*"-*) kk =0,1,2 (3.8-18) 


asxsb 


Example 3.8 Determine the natural cubic spline S(x) which interpolates to the values of y, 
at the points a,, i= 1, ..., 5, where 


We have that h, = .05, h, = .09, h, = .06, h, = .08, so that using (3.8-9) leads to 


A= m= A =F wg =e AHH Hh 
a, = 9540 a, = 8533 0, =.7717 a, =.7150 
d, = — 4.3157 d, = — 3.2640 d, = 2.4300 


Inserting these values into (3.8-10), we get 


3M,+2M,+3M; = — 43157 
3M,+ 2M,+4M, = — 3.2640 


3M, + 2M, + 4M, = —2.4300 


Since we wish to find a natural spline, we have that M, = M, = 0, so that we are left 
with a tridiagonal system of three equations. Solving this system by the algorithm given in 
Sec. 9.11 [cf. (9.11-8) to (9.11-10)], we find that 


M,=—18806 M,=—.8226 M,=-—1.0261 
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Inserting the values of the M;’s, h;s, x;s, and y,/s into (3.8-5), we get for S(x) the 
following representations: 


~ 6.2687(x — .25)? + 10(.30 — x) + 10.9697(x — .25) x € [.25, 30] 
— 3.4826(.39 — x) — 1.5974(x — .30)° 
+ 6.1138(.39 — x) + 6.9518(x —.30) x € [.30, .39] 
~ 2.3961(.45 — x)? — 2.8503(x — 39) 
+ 10.4170(.45 — x) + 11.1903(x —.39) x € [.39, 45] 


S(x) = 


—2.1377(.53 — x) + 8,3987(.53 — x) + 9.1000(x —.45) xe [.45, .53] 


3.9 OTHER METHODS OF INTERPOLATION; 
EXTRAPOLATION 


In interpolation, as in all branches of numerical analysis, there will be special 
cases in which methods superior to the general ones derived in this chapter 
can be derived and used without an unreasonable expenditure of effort. One 
example of this is the case of periodic functions in which methods based on 
Fourier-series approximations may be preferable to the polynomial approxi- 
mations of this chapter; for more on this, see Sec. 6.6-1. Another example is 
the use of rational functions instead of polynomials. While there are some 
theoretical and practical problems with rational interpolation {37}, which, of 
course, includes polynomial interpolation as a special case, it warrants more 
serious attention than is normally given to it. In fact, whenever the function 
to be interpolated is known to have a special functional character, an 
approximation based on this known functional character may be desirable 
{34}. 

Although we are restricting ourselves in this book to functions of a 
single variable, interpolation of functions of two or more variables can often 
be effected by a sequence of interpolations using the formulas of this chapter 
{36}. 

This chapter is entitled Interpolation, but it has been equally about 
extrapolation. Interpolation and extrapolation are, in fact, two aspects of the 
same type of procedure. Of the two, interpolation is much more common 
than extrapolation. The reason for this is straightforward and practical. We 
argued in Sec. 3.4 that p,(x) is minimized when x is as nearly as possible 
in the center of the interval spanned by the tabular points. Conversely, as 
x moves outside the interval spanned by the tabular points, as is the case in 
extrapolation, the factors x — a; in p,(x) grow, and therefore, the error tends 
to grow unless x is very close to one of the endpoints of the interval I 
spanned by the tabular points. Furthermore, from (3.2-9) we note that the 
best a priori bound for E(x) based on (3.2-9) is given by 
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| E(x)| < Mele He) (3.9-1) 
where M,(x) = max | FE) | (3.9-2) 


and J, is the interval spanned by a,, ..., a, and x. Now as x moves further 
from J so that the length of J, increases, M,(x) can only increase and this 
also contributes to the growth of the error bound. Of course, for a particular 
value of x, f(&) and hence also E(x) may be very small in magnitude, but 
we cannot know a priori when this occurs and must rely only on the error 
bound (3.9-1). 

Thus extrapolation is inherently a more inaccurate process than interpo- 
lation and must always be used with extreme caution. When extrapolation 
must be used in some form (see, for example, Chap. 5), the value of x should 
be restricted to be as near the interval spanned by the tabular points as 
possible. 


BIBLIOGRAPHIC NOTES 


Sections 3.1 to 3.4 The topics covered in these sections will be found in virtually any 
textbook on numerical analysis. In particular, excellent discussions of interpolation can be 
found in Hildebrand (1974), Kopal (1961), and Kuntzmann (1959). The orientation in these 
books as well as in most other numerical analysis texts is much more toward difference and 
divided-difference techniques {22} than in this book. An excellent though somewhat older 
reference to classical interpolation techniques is Steffensen (1950). A modern treatment of the 
subject is contained in Davis (1963). Hartree (1958) and Whittaker and Robinson (1948) 
contain a number of practical hints for special situations. 

The coefficients of both the Lagrangian and finite-difference interpolation formulas have 
been extensively tabulated. A bibliography of these tables will be found in Fletcher, Miller, 
Rosenhead, and Comrie (1962). A convenient collection of the more useful formulas is given by 
Davis and Polonsky (1964). 

The error term in the Lagrangian formula is discussed in a more general context in Milne 
(1949). The derivation of the error term used here can also be found in Scarborough (1962). For 
another approach, see Hildebrand (1974) or Kopal (1961). 

Our discussion of the lozenge diagram follows closely that of Hamming (1973); see also 
Kopal (1961). The use of difference methods in the construction of mathematical tables is 
considered by Fox (19575); see also Fox (1957a). 

The operational techniques introduced in Prob. 9 are further considered in the problems 
after the next chapter. A thorough discussion of these techniques will be found in Hildebrand 
(1974). The convergence of interpolation series {19} is considered by Erdos and Turan (1937) 
and by Davis (1963). 

Section 3.5 The basic references on iterated interpolation are the papers by Aitken (1932) 
and Neville (1934). 


Section 3.6 Inverse interpolation will be considered in much greater detail in Chap. 8; for 
tables of coefficients for particular cases of inverse interpolation using differences see Salzer 
(1943, 1944, 1945). 


Section 3.7 Hermite interpolation is discussed in many texts [e.g., Hildebrand (1974), 
Kopal (1961)}. 
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Section 3.8 The treatment of splines in this section follows that of Ahlberg, Nilson, and 
Walsh (1967). The minimal property of natural cubic splines was discovered by Holladay 
(1957). The result (3.8-18) appears in Birkhoff and de Boor (1964). Further applications of 
splines in numerical analysis will be found in Greville (1967, 1969). 


Section 3.9 A detailed discussion of interpolation in several variables is given by Steffen- 
sen (1950); see also Pearson (1920). The theory of rational interpolation is treated by Meinguet 
(1970), while algorithms for implementing rational interpolation appear in Bulirsch and Rutis- 


hauser (1968). 
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PROBLEMS 


Section 3.2 


1. (a) If n is the order of a Lagrangian interpolation formula, show that 
Yajil(x)=x* k=0,...,n-1 
j=1 


where the a, are the tabular points. 

(b) For n = 3 and equally spaced tabular points, compute max,,,_ ,,) |/(x)| forj = 1, 2, 3. 
Use Table 3.1 to estimate the bounds on I(x) for n = 5. Use these results to make an inference 
on the importance of roundoff error in interpolation using equally spaced data. 


Section 3.3 


2 (a) Using equally spaced data and a three-point Lagrangian formula, find a bound on 
h(x) which, on the interval spanned by the three points, assures a truncation error of less 
than 10-4, where d is an integer. 

(b) Similarly, find a bound on h°f*(x) when using a five-point Lagrangian formula. 

(c) Use these results to estimate the maximum value of h, for both the three- and five- 
point cases, that can be used to interpolate (i) sin x on [—z, 2], (ii) e* on [—4, 4], and (iii) 
sin 100x on [—z, x], with a truncation error of less than 107 '°. 

3 (a) Show that y(x) in the Lagrangian interpolation formula is the unique polynomial of 
degree n — | passing through the points [a,, f(a,)]. 

(b) Use the Lagrangian interpolation formula to find the cubic passing through the points 
(—3, —1), (0, 2), (3, —2), (6, 10). 

4 (a) Do the computation of Examples 3.1 and 3.2 with the same tabular points when 
f(x) = sin x. 

(b) Repeat part (a) using tan~' x. 

*5 Consider the following table for the Bessel functions J,(x), p = 0, 1, 2, 3, 4, 5 correctly 
rounded to four decimal places 
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x Jo(x) J (x) J2(x) J3(x) J4(x) J s(x) 
2.0 2239 5767 3528 1289 .0340 .0070 
2.1 .1666 5683 3746 1453 .0405 .0088 
2.2 .1104 5560 3951 1623 0476 0109 
2.3 0555 5399 4139 .1800 0556 0134 
2.4 .0025 5202 4310 1981 0643 0162 
2.5 — .0484 4971 4461 .2166 0738 0195 
2.6 — .0968 4708 4590 .2353 0840 0232 
2.7 —.1424 4416 4696 .2540 0950 0274 
2.8 —.1850 4097 4777 2727 .1067 0321 
2.9 — 2243 3754 4832 2911 1190 0373 
3.0 — .2601 3391 4861 3091 1320 .0430 


(a) Suppose you wished to interpolate to find values of Jo(x) at x = 2.05 + .1j,j =0,...,9. 
Use the relation 


Si(x) = —Jy a(x) + = JX) 


to find a bound on the truncation error in the worst case using (i) linear interpolation; (ii) a 
Lagrangian three-point formula. Which of these methods would you use if you wished to 
guarantee a total error in the result for every j of less than 5 x 10~* in magnitude? 

(b) Carry out the interpolation using this method. 

(c) Repeat parts (a) and (b) to find values of J,(x) at x = 2.05 + .1j,j = 1,..., 9. 

(d) How many correctly rounded decimal places for J (x) would have to be given for the 
use of a five-point Lagrangian formula to give significantly higher accuracy than the three-point 
formula? 

6 Use the data of Prob. 5 and a three-point Lagrangian formula to approximate (a) 
J ,(2.07), (b) J,(2.405), (c) J,(2.64), (d) J,(2.91), with p = 0, 1, 2. 

7 Derive (3.3-8). 

8 Show that the first difference (forward, backward, or central) of a polynomial of degree 
nis a polynomial of degree n — 1. Thus deduce that the nth difference of a polynomial of degree 
n is a constant and the (n + 1)st is zero. 

9 Difference operators. Define the shifting operator E to be such that Ef (x) = f(x + h). 
Using this and the definitions of A, V, and 6, establish the following identities: (a) A = E — 1; (b) 
V=1-—E';(c)6=E'!? — E~'/?, Then use these relations to derive relations between A and 
V and between A and 6. 


*10 (a) Using the rules of Sec. 3.3-2, show that any closed path of the form 
A’fe-1 


—~ 


AJ-'f, Astif _, 


\ 7 


Af 


results in no contribution to any interpolation formula. 
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(b) Thus deduce that the path from A/~'f, to A/f,_, to A’t'f,_, results in the same 
contribution as the path from A’~'f, to A¥f, to A/*'f,_,. Similarly, show that the path from 
AJ 'f, to Af, , to AJ*'f,_, to AYf, and the path from A/~ 'f, to A4f, result in the same contribu- 
tion. From these results deduce that any closed path contributes nothing. 

(c) Show also that the path from f,,, to Af; to f, contributes nothing. 

(d) Use the results of parts (a), (b), and (c) to deduce that all formulas which terminate on 
a given difference and start anywhere in the functional-value column are algebraically 
equivalent. 


*11 (a) Given a table of values at an interval h, discuss how you would generate a new 
table (“subtabulate”) at an interval ph, 0 < p <1, by using an appropriate interpolation 
formula. 

(b) Show that as n> oo, the left-hand side of (3.3-11) approaches f(a, + hm) if the series 
on the right-hand side converges (see Prob. 19). 

(c) Let the forward-difference operator with respect to the interval ph be represented by 
A,. Using (3.3-8) and the result of part (b), show that 


Aify= ¥ E(-yra(i)() a j=1,2,... 


i=0 k=0 


(d) Use the results of Prob. 9 to show that in operational! form 
Ai fo = [(1 + A)? — 11% 


Use this to calculate Aj, j = 1, 2, 3, 4 in terms of A* and p, retaining terms through A‘. 

(e) Use the results of part (c) to subtabulate the data of Prob. 5 for Jo(x) with (i) p = 4; 
(ii) o = +4. Compare the results for p = 4 with those of Prob. 5. How could you overcome the 
problems that arise near the end of the tabulation? 


12 (a) Derive the identities 


@ St r(7) Ro (7) =(n- 1) () 


(b) Use these results to show that )%., (— 1)~'(})(2) vanishes and thus deduce that the 
right-hand side of (3.3-12) is f;. 

13 (a) Use the results of Prob. 9 to express Newton’s backward formula in terms of 
backward differences at ay. 

(b) Similarly, express Gauss’ forward and backward formulas in terms of central differ- 
ences at dy. 

(c) Using the notation pd?"* *f, = 4(67"*'f,,. + 67"* 'f_,,2), express Stirling’s and Bes- 
sel’s formulas in terms of central differences. 


14 (a) Show that in any finite-difference interpolation formula a difference of any order 
car: be eliminated by using the relation 6"f,,, — 5"f, = 6"* 'f, or a similar relation for forward 
and backward differences. 

(b) Use this result and the result of Prob. 135 to eliminate the odd differences from Gauss’ 
forward formula and thus derive Everett’s interpolation formula 


m(m — 1)(m — 2) (m + 1)(m)(m — 1) 


2 
3! ofo + 6 


y(m) = (1 — m)fo + mf, — Sfp ton 
(This formula is useful in interpolating in tables which provide auxiliary tables of even central 
differences.) 

(c) Similarly, eliminate the even differences in Gauss’ forward formula to get Steffensen’s 


interpolation formula 


(m+ 1)m 
2! 


fant (m + 2)(m + 1)m(m — 1) $f, _. 


m(m — 1) 
2 4! 


y(m) = fo + ft)2 — 
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[Ref.: Hildebrand (1974), pp. 143-144, or Kopal (1961), pp. 50-54.] 


*1§ Throwback. (a) Use the result of Prob. 13c to show that the ratio of the coefficient B, 
of the fourth central difference in Bessel’s formula to the coefficient B, of the second difference is 
(m + 1)(m — 2)/12 and that for 0 < m < 1 this ratio varies between —} and —%. 

(b) Because this ratio varies very little on this interval, consider replacing B, by cB,. 
Show that B, — cB, as a function of m has a maximum independent of c and two minima 
dependent on c on (0, 1]. Find the two values of c which equalize the minimum and maximum 
values of B, — cB, on this interval. Show that one of these values c, is very nearly equal to the 
average value of B,/B, over [0, 1]. 

(c) Thus rewrite Bessel’s formula as y(m) =4(fo+f,) + (m— 4)6f,,2 + B[67f + 
6°, + ¢,(6*f5 + 6*f,)]. This procedure is called throwback; i.e., we have thrown back the effect 
of the fourth difference onto the second difference. [Ref.: Kopal (1961), pp. 54-56.] 


16 (a) Display the error terms for the Newtonian and Gaussian interpolation formulas 
terminated with the difference of order k in terms of h and m. 

(b) Use these to derive the error terms for Stirling’s and Bessel’s formulas terminated with 
an odd or even difference. 


Section 3.4 


17 (a) What abscissas are involved in the calculation of each entry in the table in 
Example 3.4 for both the Gaussian and Newtonian formulas? 

(b) Verify that the actual error using six differences is consistent with that calculated using 
the result of Prob. 16a. 


18 (a) How many terms in Gauss’s forward formula can be used if x, is (i) the next to last 
entry in a table; (ii) the fourth entry? 

(b) Use (3.3-18) and (3.3-19) to show that when m is near zero, Stirling’s formula is a 
desirable one to use and that when m is near one-half, Bessel’s formula is desirable. 


*19 (a) If h is fixed, show that in the limit as n— o> Newton’s forward formula, if it 
converges, becomes with a, = 0 


Se) = fo + yaa hy [x—(G- 0] 


(b) For f(x) = e* and ay = 0, show that A’, = (e" — 1Y. 
(c) By using the result of part (b) in part (a), show that the ratio of the k + 1 and k terms of 
the series is given by 
en — 1 
h(k + 1) 


(d) By considering this ratio as k + 00, deduce that the series in part (a) converges if 
e* <2 and diverges if e** > 2 unless x is a positive integral multiple of h, in which case it 
converges. (A more difficult result is that for e“" = 2 the series converges if and only if x > —h.) 

(e) Thus deduce that Newton’s forward formula is an asymptotic series for e* when 
e** > 2. Contrast this with the convergence of the Taylor series for e* for all ax. In practice, why 
would we expect Newton’s formula to be asymptotic even when e* < 2? [Ref.: Hildebrand 
(1974), pp. 154-156.] 

20 Suppose you have a table of sin x at an interval h = .1. How many tabular points 
would have to be used in interpolating in this table to assure a truncation error of less than (a) 
10-3; (b) 10~*; (c) 10° *; independent of a) and m? 

21 Use the data of Prob. 5 and a finite-difference interpolation formula to approximate 
(a) J,(2.07), (b) J ,(2.405), (c) J,(2.64), (d) J,(2.91), with p = 0, 1, 2. In each case motivate your 
choice of a particular interpolation formula and compare the results with those of Prob. 6. 


(x — kh) 
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*22 Divided differences. The divided difference of order k > 1 of f(x) is defined by 
_ f[az,..-, 4%] — flay, ..., a1] 


a, — ay 


fla; .-., a] 


(a) Prove that 


Say, ..., a,| = > f(a) 


i=1 (4, — 43) *** (@, — @;- y)(a; — a4) °** (a — a) 


and thus deduce that the order of the arguments in a divided difference is immaterial. 

(b) Use the data of Example 3.1 to generate a divided-difference table analogous to the 
difference table of Fig. 3.1. 

(c) Show that 


with f[a,] = f(a,) 


S (ay, .--, 1, x] =f[a,, ..., a,] + (x — a,)ffa,, ..., a, x] 


and use this result to derive the formula 


I (x) =f [ay] + (x — ay) flay, a2] + (x — a4)(x — a) f[a,, a2, a3] 
+++ +(x —a,)(x — az) °°: (x — a,_ ,)f[ay, ..-, a,] + E(x) 
where E(x) = p,(x) fay, --- @as X] 
This formula is called Newton’s divided-difference interpolation formula. 
(d) Deduce from part (c) that (x — a,)f[a,,...,4,, x] ~0asx-7a,,k=1,...,n. 


(e) Use this result to show that this formula must be algebraically equivalent to the 
Lagrangian interpolation formula which uses the tabular points a,, ..., a,. Thus deduce that 


1 
fla,, vesy Ay, x] = a) 


where € is in the interval spanned by a,, ..., a, and x. 

(f) Use the results of parts (c) and (e) to show that when a, — a, i = 1,..., n, the Newton 
divided-difference formula and therefore the Lagrangian formula are both equivalent to a 
Taylor series with remainder. 

(g) Use Newton's divided-difference formula and the table of part (b) to approximate In 
.60. Compare the result with Example 3.1. 

(h) When the tabular points are equally spaced, show that 


1 


k-1i 
(k - tint fi 


Slay, ..-, &] = 


and thus use part (c) to derive Newton’s forward formula. 


Section 3.5 


23 (a) Show that the table of (3.5-2) can be replaced by the symmetrical arrangement 


a1 a, ~ x y,(x) 


Y12(x) 

Y23(x) Y123(x) 
: y1,2 -nlX) 
° Yn-2,n-1, n(X) 

Yn-1, n(X) 


a, yx y,(x) 


What would the additional entries to the table be if the point a,, , were used? 


86 A FIRST COURSE IN NUMERICAL ANALYSIS 


(b) Use the technique of part (a) to do the computation of Example 3.5. [Ref.: Neville 


(1934).] 


24 Use iterated interpolation to do the calculations of parts (a) and (b) of Prob. 6. Com- 
pare the resulis with those of Probs. 6 and 21. 


25 Interpolation near a singularity. Suppose you are given a tabulation of sine, cosine, 


and tangent as follows: 


x sin x COs x tan x 

1.566 9999885 0047963 208.49 128 
1.567 9999928 .0037963 263.41125 
1.568 9999961 .0027963 357.61106 
1.569 .9999984 0017963 556.69098 
1.570 9999997 .0007963 1255.76559 


Using these data and any interpolation formula, calculate tan 1.5695 by (a) using the tan x 
tabulation directly; (b) calculating sin 1.5695 and cos 1.5695. Discuss the reasons for the vary- 
ing errors in the two results. [Ref.: Kopal (1961), p. 84] 


Section 3.6 


26 Use the data of Prob. 5 and inverse interpolation to approximate the zero of J9(x) 
between 2 and 3 (cf. Prob. 6b). 


27 Inverse interpolation near a singularity. Suppose we wished to calculate that value of x 
for which sin x = .9999950 using the data of Prob. 25. Why will the procedure of Sec. 3.6 not 
work here? To solve this problem: 

(a) Obtain an initial approximation X to x by linear inverse interpolation between 
x, = 1.567 and x, = 1.568. 

(6) By direct interpolation, compute sin Xx. 

If sin X < .9999950, replace x, by X and repeat this procedure. Otherwise, replace x, by x. 
Continue until the process converges. What condition must sin x satisfy on [x,, x] in order for 
the process to converge? 


Section 3.7 


28 Derive Eqs. (3.7-10). 


29 Use the data of Prob. 5 and Hermite interpolation to do the computation of parts (a) 
and (b) of Prob. 6 for p = 0. Compare with the previous results. Can the use of the Hermite 
interpolation formula be simplified for equally spaced data in a fashion analogous to that for 
the Lagrangian formula in Sec. 3.3-1? 


Section 3.8 


30 Determine the explicit form of 1,, d,, u,—1, and d,_, in case 3 on p. 76 and show that 
Jaz] <2, [oy-1] <2. 
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31 (a) Verify that if S(x) is a cubic spline, then 


[ LQ)? ax - | [S"(e)P dx = | L(x) - S*()P dx +2 | S*(x)LF"(x) - S"(x)] ax 


for any twice-differentiable function f(x). 
(b) Integrating by parts, show that if f(x;) = S(x,),i = 1,..., n, where the x, are the nodes 
of S(x), then 


[ S*@)LF"() — S"(x)] dx = S"(b)LF'(b) — S'(b)] - S*@)LF(a) — S’(a)} 


(c) Hence prove that the natural spline uniquely minimizes (3.8-15) among all functions 
with continuous second derivatives which interpolate the data at the points x,,i= 1,..., n. 

(d) Similarly prove that the spline which satisfies (3.8-13) uniquely minimizes (3.8-15) 
among all such functions which in addition satisfy f’(a) = y}, f’(b) = y,,. 


32 (a) Using Rolle’s theorem, show that for every x €[a;,a,,,] there exists a point 
z € [a;, a,,,] such that 


fs) — Sx) =f LW so) a 


z 


(b) Using the Cauchy-Schwarz inequality, show that 


1/2 


|x — 2|'/? 


[f'(x) - Sx) <|f L£")- SOP at 
(c) Prove (3.8-16) and (3.8-17). 
33 (a) Determine the cubic spline S(x) which interpolates to the data of Example 3.8 and 
which also satisfies the conditions 


S'(.25) = 1.0000 _S$"(.53) = .6868 


(b) Determine the cubic spline S(x) with nodes a,, a3, a, which interpolates to the values 
of y2, ¥3, Yq Of Example 3.8 and which also satisfies the conditions S(.25) = .5000, 
S(.53) = .7280. 

(c) The values of y, given in Example 3.8 are equal to a}/?, i= 1, ..., 5. Evaluate the 
integral {-35 S(x) dx for the three splines given in Example 3.8 and parts (a) and (b) and 
compare the answers with the true value of {:33 x'/? dx = .1739. Similarly, evaluate S(.35) and 


S'(.35) and compare the answers with ./.35 = .5916 and 1/2/35 = .8452, respectively. 


Section 3.9 


34 Prony’s method. Given 2n pairs (x;,, y;), i= 0, ..., 2n — 1, such that x; = x9 + ih, to 
find 2n values a,, bj, j= 1, ..., n, such that f(x,)=y,, i=0, ..., 2n— 1, where f(x) = 
yre, ayer, 

(a) Show that this problem is equivalent to that of finding 2n values c pHpJ=l,...n, 
such that g(i) = y,, i =0,..., 2n — 1, where g(x) = )'"_, c(u,)*. 

(b) From the equations g(i) = y,,i=0,...,2n — 1, taken n + 1 at a time, derive a system 
of linear equations for the coefficients d,, k = 0, ..., n — 1, of the polynomial 


Pl(xX)= Yo dx* d,=1 
=0 
whose roots are y,, j = 1,..., n. 


(c) Having found the d, and the y,, find the c, by solving another system of n linear 
equations. Finally express the required a,, b, in terms of the c;, ;, j= 1,..., n. 
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(d) What happens if one of the roots of p,(x) is complex? [Ref.: Hildebrand (1974), 
pp. 457-462.] 


35 Apply Prony’s method to the following two sets of data 


36 Interpolation of functions of two variables. Suppose we are given a function of two 
variables f(x, y) tabulated at points (a,, b,), i= 1,...,n,j = 1,...,m. 

(a) If we wish to approximate f(x, y) at a nontabular point, show that we can do this by 
first interpolating to find f(x, b,) for a sequence of values of j and then using these values to 
interpolate to find f(x, y) or vice versa. 

(b) Given the table of values of the elliptic integral 


y 
E(x, y) =| (1 — sin? x sin? t)'/? dt 
0 


y 50° 54° 58° 62° 


50° 0.8134 0.8060 0.7988 0.7920 
52° 0.8414 0.8332 0.8251 0.8174 
54° 0.8690 0.8598 0.8508 0.8422 
56° 0.8962 0.8859 0.8759 0.8663 


find an approximation to E(55.4°, 53.1°) by (i) interpolating horizontally to find E(55.4°, y) for 
y = 50°, 52°, 54°, 56° and then interpolating vertically; (ii) interpolating vertically and then 
horizontally. If the desired point lies on a diagonal, for example, (52°, 51°), how could the 
interpolation procedure be simplified? [Ref.: Hildebrand (1974), p. 167.] 

37 Rational interpolation. Consider the problem of finding a rational function 
Run(X) = Prn(x)/Q,(x), where P,,(x) = )7.9 a,x! and Q,(x) = ) 729 b,x’, such that R,,,(x;) = yy, 
j=1,..., 5. 

(a) Show that the number of free parameters in R,,,(x) is m + n + 1 so that we must have 
s=m+n+1. 

(b) Show that if a rational interpolation function R,,,(x) exists, it is unique. 

(c) By considering the case x, = 0, 1, 2, y, = 1, 3, 3, j = 1,..., 3, m= n= 1, show that a 
rational interpolation function does not always exist. 

(d) By considering the case x, = 0, 2, 3, y, = —1,1,4,j=1,..., 3, m=n= 1, show that 
even when a rational interpolation function exists, it need not be continuous on the interval 
spanned by the x,. [Ref.: Bulirsch and Rutishauser (1968) p. 278.] 
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FOUR 


NUMERICAL DIFFERENTIATION, 
NUMERICAL QUADRATURE, AND 
SUMMATION 


4.1 NUMERICAL DIFFERENTIATION OF DATA 


Whereas in interpolation we are only concerned with the case where the 
function is represented by a table of values, in numerical differentiation we 
may be given either a table of values or a closed expression for the function. 
In the latter case, the given form will usually be a single function statement, 
and it will normally be possible to differentiate the expression analytically 
using one of the many available formula-manipulation languages. Indeed, 
analytic differentiation has proved useful in various areas of numerical 
analysis and should be applied more frequently. When analytic differentia- 
tion can be performed on a computer, there is little more that need be said. 
Numerical differentiation of functions which cannot (reasonably) be differ- 
entiated analytically is considered in the next section. In this section we shall 
be concerned with the differentiation of numerical data. 

When the function is represented by a table of values, the most obvious 
approach is to differentiate the Lagrangian interpolation formula (3.1-1). This 
gives us 


F%s) = FHM) / a) + [PAD gmc] = ymtey + te) 4.11) 
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In particular, for k = 1 we have 


F) = Foe) sla) +5 PD pmey 41-2) 


where the derivative of I(x) is easily calculated using (3.2-4). Determination 
of the error term in (4.1-1) or (4.1-2) presents a problem because ¢ is an 
unknown function of x (see Sec. 3.2). However, it can be shown that 


d* 7 fe) rz 
dx' E(x)] = (n — k)! Ile — 1) (4.1-3) 


where the n — k distinct points 7; are independent of x and are known to lie 
in the intervals 


Ap< Nj < Ayre j=Hl,...,n—-k 


€ depends on x and is in the smallest interval J containing x and the n;. 

If, in addition to function values at the data points, we are given values 
of some of the derivatives at several points, we can derive the coefficients 
w')(x) in the formula 


six \fa;) (4.1-4) 


iM 


n 
f(x ~ 2 


using the method of undetermined coefficients {1}. 

An alternative approach to numerical differentiation of data for approx- 
imating the first and second derivatives is to compute a cubic interpolating 
spline (see Sec. 3.8) and differentiate it. For any particular value x, we deter- 
mine the two nodes a; and a;,, such that a, < x < aj,,. Then y'(x) = Si(x) 
and y"(x) = Sj(x), where S,(x) is that piece of the spline which is a cubic 
polynomial in [a;, a;,,]. By the result quoted in Sec. 3.8, we are assured 
that, asymptotically, as h, the maximum distance between nodes, goes to 
zero, 

max | f(x) — S*(x)| =O(h*-") kk =1,2 

asx<b 
This indicates that the first and second derivatives of the interpolating spline 
approximate the behavior of the corresponding derivatives of the given 
function. By contrast, such a situation may not hold in the case of (4.1-1). We 
can see this more clearly if we consider the case of equally spaced tabular 
points. We then have that 


y(x) a (m) f X =) + hm (4.1-5) 
where the differentiation - l Lm m) is with respect to m and we have used the 


fact that i dvd td 
ay _ fyam _ ta . 
dx dmdx hdm (4.1-6) 
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In a similar fashion, we can differentiate the interpolation formulas ex- 
pressed in difference form. For example, if we differentiate Newton’s forward 
formula (3.3-11), we get 

d* 1 d* 

dx y(ao + hm) = he dm* y(ao + hm) 


1 4 d [m\ ,, 1 < d [m)\ ,, 

_ h* a a") A’fo _ hk x ni (7) A’fo (4.1-7) 
The appearance of the factor 1/h* in (4.1-5) makes explicit a fact which was 
hidden in (4.1-1). In numerical differentiation formulas, we divide by small 
quantities. This in itself would not be critical, but since our final result is not 
large, the numerator must also be small. This can occur only if we subtract 
quantities of the same order of magnitude. This causes a large relative error 
in the result, especially if the original data are of low accuracy. In this 


situation, a better approach to numerical differentiation may be to first 
“smooth” the data and then to differentiate. 


Example 4.1 Suppose we are given the empirical data (cf. Sec. 6.5) 


7.9493 
8 9.0253 
10.3627 


and we wish to find an approximation to the value of f’(.5). Since the data points are 
equally spaced, we can use formulas of type (4.1-5). Inasmuch as it is desirable to choose 
points symmetrically about the given evaluation point, we shall approximate f’(.5) by the 
three-point and five-point formulas 


f(x) = a_ f(x — h) + ag f(x) + a, f(x + h) 
S's(x) = b_ 2 f(x — 2h) + B_ f(x — h) + by f(x) + by f(x + h) + by f(x + 2h) 


By the method of undetermined coefficients {1} or directly from (4.1-5) we find first that 
—a_, = a, = 1/2h, ay = 0, so that [cf. (4.2-3)] 


x _ —h 
fs) = LEM afl — 


and second that b_, = —b, = 1/12h, —b_, = 6, = 8/12h, by =0, so that 
f=a 
Substituting the values in the table above, we find that 
f3(.5) = 5.8000 and =s f'(.5) = 5.7495 
The values given in the table are perturbations of the values of 
f(x) = x* + 3x3 + 2x7 +x 45 


[f(x — 2h) — 8f(x — h) + 8f(x + h) —f (x + 2h)] 
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which can be thought to arise either from observational error or roundoff or both. The 
exact value is f’(.5) = 5.7500, and we see that f’,(.5) is indeed almost exactly equal to f’(.5) 
while f3(.5) is not too bad an approximation. If we approximate f’(.5) using the least- 
squares polynomial (cf. Sec. 6.5) 


yq(x) = .9988x* + 2.9898x> + 2.0172x? + .9920x + 5.0010 


we find that y‘,(.5) = 5.7510, which is also very good. 

The small error in f’(.5) was due to a fortunate cancellation of roundoff error. Since 
f(x) is a fourth-degree polynomial, f',(x) would be exact in the absence of roundoff; here 
we have achieved such accuracy because of cancellation of roundoff since the true values 
of f(.4) and f(.6) are respectively 5.9376 and 7.0976, so that 


Ff (6) — f(A) =fe —fa 


That we are not always so fortunate can be seen from the following problem using 
the same data: compute approximations to f’(.1) and f”(.1). In this case it is more conven- 
ient to use differences based on formula (4.1-7) with m = 0. The formulas for the first and 
second derivatives are then, up to fourth differences, as follows: 


hf © Afy — 4 A%fo + 4 Ay —4 A (4.1-8) 
oe A*fy _ A*f, + +] A*fo (4.1-9) 


We set up the following difference table: 


fi Af, Ay AF At, 


5.1234 
.1823 
5.3057 .0807 
.2630 0254 
5.5687 1061 — .0014 
3691 .0240 
5.9378 1301 
4992 
6.4370 


Using formula (4.1-8) starting with the approximation hf) ~ Afo and adding an additional 
difference each time, we find in succession the following approximation to f’(.1): 


1.823, 1.420, 1.504, 1.507 


The exact value is 1.494, and the least-squares values is 1.489. For f”(.1) the values 
found using (4.1-9) are in succession 


8.07, 5.53, 5.40 


The true value is 5.92, and the least-squares value is 5.95. 

The influence of errors in the data is evident here, since, in the absence of errors, the 
formula using fourth differences should give the true answer for a fourth-degree polynomial. 
Thus, the initial error in the data of less than .01 percent becomes magnified tenfold in the 
first derivative and a hundredfold in the second in accordance with the general result that 
the roundoff error in the kth derivative is of the order of 1/h* times the roundoff error in 
the data when the data are given at equally spaced points with spacing h. 
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4.2 NUMERICAL DIFFERENTIATION OF 
FUNCTIONS 


When we must compute the derivative of a function which can be evaluated 
anywhere in a given interval J, the following procedure is preferable to those 
of the previous section. If x + h and x — hare in I, we begin with the Taylor 
expansions 


2 3 
fet h)=fx) t+ SH a)ty Ieee 


/ h? “ h? mt 
f(x —h) =f(x)-WX)+>f (x)—a;f (x) + °°: 
Subtracting, dividing by 2h, and rearranging terms, we find that 


Wo\ f(x + h) — f(x —h) SF (x) 2i 
f(x) =a > + Gen (4.2-1) 
Similarly, we find that 
niyy — LO +h) = 2 (x) +f —h) LS AFF PMX) 1; 
fx) = art Daye 


Thus, we can expect that by taking h small enough we can approximate f‘(x) 
and f”(x), respectively, by 


(4.2-2) 


f(x +h) —f (x — h) 


f'(x) = nn) (4.2-3) 
f(x) SEAM AO 4S 1) (4.2-4) 


Our strategy would then be to evaluate (4.2-3) [or (4.2-4)] for a sequence of 
values of h tending to zero, stopping when we get agreement to the desired 
accuracy. 

This procedure, however, is fraught with danger since eventually round- 
off will dominate the calculation. As h tends to zero, both f(x + h) and 
f(x — h) tend to f(x), so that their difference tends to the difference of two 
almost equal quantities and thus contains fewer and fewer significant digits. 
Thus, it is meaningless to carry out this computation beyond a certain 
threshold value of h which is dependent on the accuracy with which f(x) can 
be computed. If a bound on the relative error in the computation of f(x) is 
given by R, that is, the computed value f(x) = f(x)(1 +r) with |r| < R, 
then a bound on the relative error in (4.2-3) is given by R/h. If we wish a 
relative error of € or less in the first derivative, we cannot allow h to go below 
R/e. For the second derivative the threshold is h? > 4R/e. 

If while computing (4.2-3) or (4.2-4) with successively smaller values of h 
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we find that we reach the threshold value of h before we have achieved 
agreement to the desired accuracy between several successive values, we can 
resort to the technique of Richardson extrapolation, based on expansions 
(4.2-1) and (4.2-2). This is a general technique which we shall now explain. 

In the general case of Richardson extrapolation, we have a procedure for 
approximating a functional ¢@ = ¢(/) of a given function f by a sequence, 
depending on a step length h, in which as h > 0 the error has the asymptotic 
form 


E= Y a,h’i Vi << (4.2-5) 
where the a, are constants which may depend on the function fbut not on h. 
The a; are unknown to us, but we assume knowledge of the »,. If the 


computation 1s carried out for two different values of h, h,, and h,, h, < hy, 
resulting in two approximations @, and @,, then the true value ¢ is given by 


¢=¢,+ > ajh} i=1,2 (4.2-6) 
j=l 


Multiplying (4.2-6) with i = 1 by h}! and (4.2-6) with i = 2 by h)', subtract- 
ing, and solving for ¢, we get 


! © AVY — hithy 
o= = et — Wy (hi'd, — h’; p,) + da re (4.2-7) 
When we set h, = phy, p < 1, (4.2-7) becomes 
p” 0 YF 
b= 2 2 PO +34 p = .+ Du hii (4.2.8) 


a 


which is of the form (4.2-6). If we wish to eliminate the term with j = 2 in 
(4.2-8), we must compute ¢, with h, = ph, and compute @,, in a similar 
fashion and then compute ¢,,; from ¢,, and ¢,3. 

Let us formalize this procedure. If we denote ¢; by Ty, we can generate a 
triangular array of approximants T‘, by the formula 


Ti, — p™Ti,- Tm 1 — Tin - 
p Ym __ 1 


i=1,2,... (4.2-9) 


where the element T1_, corresponds to $15... m 
In some cases, a geometric growth of 1/h by a factor 1/p is not desirable or 
possible, and we are interested in a general sequence {h,}. In this case, we can 
generate a table T;, by a simple algorithm only if the y, have a special 
structure, y; = jy + 6. For the usual case 6 = 0, we have that {6} 
Y i+ 1 — hy _ Tf 3, - Ti _ 


Tm = (h i/hi+m)" —1 
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The general case 6 #0 is considered in a problem {7}. When both 
h, = p' "h, and y,; = jy, we have a special case of both general cases, and we 
see that both (4.2-5) and (4.2-10) reduce to the same formula 


Ti, = (4.2-11) 


Example 4.2 Compute the first derivative of f(x)= —cot (x) for x =0.04. Take 
h, = .0128 and p = 4. We get the following table using (4.2-11) with y = 2 from (4.2-1), 
where Ty(h) = [f(x + h) — f(x — h)]/2h. 


h To T, Ty T; 
0.0128 696 6346914 

623.4601726 
00064 641 7538023 625.3455055 

625 2276722 625 3334226 
0.0032 629 3592047 625 3336114 

625 3269902 625 3334448 
0.0016 6263350438 625 3334474 

625 3330438 625 3334257 
00008 6255835438 625 3334260 

625.3334021 
00004 625. 3959375 


The oscillating behavior of the values in the 7, column precludes continuing, but we have 
achieved agreement to seven figures with the exact value 625.33344002. 

If we allow ourselves more freedom in the choice of the h;, we get the following result, 
using (4.2-10): 


h To Ti Te Ts Ts Ts Té 
0.0256 1058.9377906 
495.5225783 
0.0192 812.4436352 640.1425224 
603.9875364 624.4282997 
0.0128 6966346914 626.638 1123 625.3572216 
620.9754683 625.299 1640 625.3330932 
0.0096 =: 663.533 7813 625.4479360 625.3339415 625.3334398 
624 3298191 625.3317679 625.3334344 
0.0064 641.7538023 625.348 1040 625 3334485 
625.0935328 625 3333435 
0 0048 634.4649344 625.3349836 
625.2746209 
00032 629.3592047 


Here we have achieved nine-figure accuracy even though our first two approximations in 
the T, column are way off. 
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4.3 NUMERICAL QUADRATURE: THE GENERAL 
PROBLEM 


We now consider the problem of numerical quadrature, namely the approxi- 
mation of the linear functional 


Nf) ={ fe) de -o <a<b<o (4.3-1) 
Since I(f) is linear, it makes sense to approximate it by the linear functional 
O(f) = y DAs f(ai;) (4.3-2) 
so that we have the quadrature equationt 
Kf)=Q(f)+E (4.3-3) 
or (F(x) dx = y } Ai; f (aij) + E (4.3-4) 
“a j=1 i=0 


Setting E = 0 in (4.3-4) then gives us an approximation to the definite integral 
of f(x) as a linear combination of values of f(x) and its derivatives. The 
numerical-quadrature problem is to specify the A,,;s and a;,;s so that this 
approximation has desirable properties, i.e., achieves some desired accuracy. 

Once again our approach will be that of exact polynomial approxima- 
tion. That is, we shall attempt to choose the A;,'s and a;,’s so that E in (4.3-4) 
is zero when f(x) is a polynomial of sufficiently low degree. We shall again 
restrict ourselves mainly to the case m = 0; that is, we shall try to express the 
integral as a linear combination of functional values alone, as is done, for 
example, in the trapezoidal rule. This is by far the most important case both 
theoretically and practically. With the restriction m=0, we can rewrite 
(4.3-4) after some obvious changes in notation as 


i" f(x) dx = aH jf(aj)+ E (4.3-5) 


One equation of the form (4.3-5) can clearly be derived by integrating 
the Lagrangian interpolation formula (3.1-1). Without considering the de- 
tails of this now, we can nevertheless see that since the Lagrangian formula is 
exact for polynomials of degree n— 1 or less, then so will the formula 
resulting from its integration. This suggests the question: With no a priori 
restrictions on the “abscissas” a; (such as that they be equally spaced) and 
the “weights” H,, what is the highest-degree polynomial for which E in 
(4.3-5) can be made zero? We call the degree of this polynomial the order of 


+ Here and in the remainder of this chapter, we shall generally denote the error by E instead 
of E(x) because the variable x will not appear explicitly in the error term as it has previously. 


NUMERICAL DIFFERENTIATION, NUMERICAL QUADRATURE, AND SUMMATION 97 


accuracy of the formula. Since we have 2n constants at our disposal (n a,’s 
and n H's), we suspect that the answer is a polynomial of degree 2 — 1. In 
the next section, we shall show that this is indeed the case. 

We shall not explicitly consider the problem of evaluating the indefinite 
integral 


v(x) =| f(t) dt (4.3-6) 


in this chapter. This problem is equivalent to solving the differential 
equation 

d 

Fe 71%) vb%o) = 0 (4.3-7) 
and as such can be solved by the techniques of Chap. 5. For any specific 
value of x the methods of this chapter can, of course, be used to evaluate y(x). 


Another approach to indefinite integration which can also be used for 
definite integrals is to determine a linear approximation to f(t) 


rl) ~ Yad.le (43-8) 
where the indefinite integrals w;(x) 
Wi(x) = [ o;(t) at (4.3-9) 


are available analytically. Then 


| f(t) atx x a; w(x) (4.3-10) 
x0 i= 

gives a series approximation of the indefinite integral. Particular choices of 
o;(x) in the normalized case where x» = —1 and —1 <x < 1 are the Cheby- 
shev polynomials of the first kind 7,(x), the Chebyshev polynomials of the 
second kind S,(x), and the Legendre polynomials P,(x), described later in 
this chapter. 


4,3-1 Numerical Integration of Data 


In the succeeding sections, we shall be concerned with the numerical quadra- 
ture of functions that can be evaluated at any given point or that are tab- 
ulated at equally spaced points. A different problem is that of the integration 
of data. Here, we are given a set of pairs (a;, f;), i= 1, ..., n, and we wish to 
find an approximation to the integral of the function f(x) which is repre- 
sented more or less accurately by the given values f;. We mention three 
approaches to this problem. The first is to compute an interpolating cubic 
spline S(x), as in Sec. 3.8, and integrate S(x) exactly. This yields the 
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indefinite integral and hence also the definite integral between any two 
points in the integration interval. The resulting approximation is quite good, 
and, asymptotically, as h, the maximum spacing between the nodes, goes to 
zero, the error in the integral is of the order of h*. A second approach which 
also yields the indefinite integral is to approximate f(x) by a least-square 
polynomial p,,(x) (Chap. 6) and approximate the integral of f(x) by the 
integral of p,,(x). This is especially to be recommended when the f/f; are 
empirical data subject to experimental error. 

The third approach is to use piecewise polynomial interpolation use 
cubic polynomials. If we define p,(x) to be the cubic interpolating to f(x) at 
the four points a;_ 1, @;, @;41, 4; 2, then the approximation to the integral of 
f (x) takes the form 


| f(x) dx = x [  bilx) dx 
where p(x)=p(x) 2%2<j<n-1 
Bi (x) = p2(x) PrlX) = Pu—1(x) 


4.4 GAUSSIAN QUADRATURE 


For now let us assume that a and b in (4.3-5) are finite. Then, if (4.3-5) is to 

be exact for polynomials of degree 2n — 1 or less, we can get a set of 2n 

equations for the 2m unknown constants by substituting f(x) = x‘, k = 0, 1, 
., 2n — 1, into (4.3-5) and setting E = 0. We get 


a.= ) H,aj k=0,...,2n-1 (4.4-1) 
j= 
b peti _ gkt! 
—| y*dx= - 
where Oo, = I x* dx = kal (4.4-2) 


These nonlinear equations, if we can solve them and if the solution is real, 
will give us the abscissas and weights we desire. This algebraic approach to 
our problem is considered further in {10}, but we abandon it here in favor of 
an analytic approach which (1) will tell us without actually calculating the 
weights and abscissas whether or not they are real; (2) will enable us to 
determine E when f(x) is not a polynomial of degree 2n — 1 or less; and (3) 
will enable us to show that the abscissas are the zeros of well-known polyno- 
mials. As we shall see, once the abscissas are known, the weights are easily 
calculable. 

The starting point of our analytical approach is the Hermite interpola- 
tion formula (3.7-17) 


Flo) = She) fla) + Shia) say) + BE) pamcey (44-3) 
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which is exact for polynomials of degree 2n — 1 or less. Integrating (4.4-3) 
between a and b, we get 


[ $00) dx = YH, fla) + YAS) +e (44-4) 
where H;= “hjGe) dx H,= [lx) dx (4.4-5) 
and E = j a fem(e) dx (4.4-6) 


Since E is zero if f(x) is a polynomial of degree 2n — i or less, if we can 
choose the abscissas so that H, = 0, j = 1,..., n, then (4.4-4) will have the 
form (4.3-5) with the desired properties. Thus H; =0, j= 1, ...,n, is a 
sufficient condition to achieve our desired accuracy of order 2n — 1. It is also 
necessary. To see this let f(x) = h,(x) in (4.3-5). Since h,(x) is a polynomial of 
degree 2n — 1, E = 0. From (4.4-5) it follows that the left-hand side of (4.3-5) 
is H j;» and from (3.7-4) it follows that the right-hand side is zero. Therefore, 
since we have put no restriction on the a,’s in (4.4-4), if we cannot find 
abscissas for which the H's are zero, then no formula of the type (4.3-5) with 
order of accuracy 2n — 1 is possible. 
Using (3.7-18) and (3.2-4), we have 


_ b b L(x 
A= |G — aiiRls)e = J als) 2 dy (44-7) 
Since p,(x) is a polynomial of degree n and 1,(x) is a polynomial of degree 
n — 1, a sufficient condition for H; = 0,j = 1,..., n, is for p,(x) to be ortho- 
gonal to all polynomials of degree n — 1 or less over [a, b]. This condition is 
also necessary; we leave the proof to a problem {11}. Without loss of general- 
ity, we may assume [a, b] = [— 1, 1],t in which case the orthogonal polyno- 
mial p,(x) is a multiple of P,,(x), the Legendre polynomial of degree n. Since 
by definition p,(x) has leading coefficient 1, we have, using the standard 
definition of the Legendre polynomials {19}, 
2"(n!) 
P,(x) _ (2n)! P,,(x) (4.4 8) 


In the next section we shall prove that the zeros of the Legendre polynomial 
of any degree are real, so that this settles the question of the existence of real 
abscissas. The zeros of the Legendre polynomials have been tabulated for all 
values of n of practical interest; a short table of these zeros is given in 
Table 4.1. 


+ By the change of variable y = [1/(b — a)](2x — a — b), the interval [a, b] in x is replaced by 
the interval [—1, 1] in y. 
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Table 4.1 Zeros of Legendre polyno- 
mials and corresponding weights 


n Abscissas a, Weights H, 
1 
2 + 0.577350 = + A 1 
3 0 g 
+ 0.774597 3 
4 + 0.33998 1 0.652145 
+ 0.861136 0.347855 
5 0 0.568889 
+ 0.538469 0.478629 
+ 0.906180 0.236927 


To find the weights we again use Eqs. (3.7-18) and (3.2-4) to get 


H,;= j- h(x) dx = im — 2h(a;)(x — a,)]7(x) dx 


= | ; 1?(x) dx — 21;(a,;) i — a,)I?(x) dx 


1 
= | [2(x) dx (4.4-9) 
—~1 
since the second integral, which by definition is H,, is zero. From (4.4-9) it is 
obvious that the weights are all positive; see Table 4.1. A simpler expression 
for the weights can be found by considering (4.3-5) with f(x) = I,(x) and 
using the weights and abscissas found above. Since I,(x) is a polynomial of 
degree n — 1, E is zero and we have 


f I,(x) dx = 


since |,(a;) = 6,;. Together (4.4-9) and (4.4-10) imply 


J 


=1 


1 1 
{ 1?(x) dx = | L(x) dx (4.4-11) 
—1 —1 

The error term as given by (4.4-6) can be simplified using the mean- 
value theorem for integrals since p?(x) is always positive. Using [—1, 1] in 
place of [a, b], we have 


E= a J pe) dx (4.4-12) 


where y lies in (— 1, 1). 
Any quadrature formula whose abscissas and weights are subject to no 
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constraints and which are determined so as to achieve a maximum order of 
accuracy is called a Gaussian quadrature formula. In particular, (4.3-5) with 
the abscissas given as the zeros of the Legendre polynomial of degree n and 
the weights by (4.4-10) is called a Gauss-Legendre quadrature formula, or 
simply a Gauss formula. 


Example 4.3 Evaluate 


3 dx 
Ig 


using Gauss-Legendre quadrature with n = 3. 
With the change of variable y = x — 2, the integral becomes 


_ 
_iyt2 


Using Table 4.1 with n = 3, we have 


jt 5 81 5 1 


my = +- ~+- =~ 1.098039 
-:yt2.  91.225403 9 2 92.774597 


whereas the true value of the integral is In 3 = 1.098612. From Eq. (4.4-12) we have 


E= rw J P3tx) dx 


Using (4.4-8) and anticipating some results from Sec. 4.6 [see (4.6-10)], we have 


1 4 1 8 
2 — 2 = 
[ PRC) dx = 5 [P30 dx = = 
1 6! 8 1 
15,750 (7 +2)’ 175 (y + 2)’ 
8 1 8 
Thus .000021 ~ E < — wz .045714 


—- —. < 
175 (3)’ 175 
and the actual error is indeed within these bounds. 


Gaussian quadrature formulas were seldom used in practice before the 
advent of digital computers. This was because the use of “simple” numbers, 
1.e., integers and rational numbers, is much more convenient on desk calcu- 
lators than the nonsimple numbers, which generally must be used in Gaus- 
sian quadrature calculations (cf. Table 4.1). For example, when using desk 
calculators, functions are often evaluated by table lookup, in which case 
simple values of the abscissas may mean that no interpolation is required. 
But on digital computers, functions are almost always evaluated using a 
rational or polynomial approximation of the type to be discussed in Chap. 7. 
In this case whether the numbers are simple or not generally makes no 
difference. On digital computers, therefore, the Gaussian quadrature for- 
mulas discussed in this section and the next four are practical for certain 
problems. In Sec. 4.11 we shall discuss the general problem of choosing a 
quadrature formula for a specific application. 
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4.55 WEIGHT FUNCTIONS 


In this section we shall generalize the ideas of the previous section by con- 
sidering in place of (4.3-5) 


| w(x) f(x) dx = YH; f(a) + E (4.5-1) 


where w(x), the weight function, does not appear on the right-hand side of 
(4.5-1). That it is not artificial to separate the integrand into two functions 
w(x) and f(x) is borne out by (1) the not uncommon need to evaluate the 
coefficients in an orthogonal polynomial expansion and (2) the frequency 
with which some functions appear in integrands, particularly when dealing 
with integrals over infinite intervals (see Sec. 4.7). 

The advantages of the formulation (4.5-1) are also twofold: (1) computa- 
tionally it will generally be easier to evaluate f(a,) than w(a,) f(a;); and (2) it 
is often convenient to express the error term in terms of a derivative of f (x) 
only, especially when the weight function or one of its derivatives is un- 
bounded in the interval. 

It is, of course, possible to treat any numerical quadrature problem in 
the form (4.5-1). That is, we can always consider splitting the integrand up 
into the product of two functions. But since, as we shall see below, the 
abscissas and weights are functions of w(x), this would necessitate the evalu- 
ation of the weights and abscissas for each problem. Thus we shall consider 
only those weight functions which are practically or mathematically 
significant. Furthermore, as we shall also see below, it is important that w(x) 
be of constant sign in [a, b}. 

The evaluation of the weights and abscissas in the more general Gaus- 
sian quadrature formula (4.5-1) is again quite straightforward if we make use 
of the Hermite interpolation formula. This time we merely multiply (3.7-17) 
by w(x) before integrating and obtain in this way 


b 


f wis) fe) dx= DH Sla)+ DAS a) +E (45-2 


“a j= 


where now 

Hy=[ w(x)h(x) dx Ay, =| w(x)h{x) dx (4.5-3) 
Now, proceeding as in the previous section, we set H; = 0, j = 1,...,n, and 
get as a necessary and sufficient condition on the abscissas that they be the 


zeros of a polynomial orthogonal with respect to w(x) to all polynomials of 
lesser degree over [a, b] {11}. That is, 


| w(x)uy-1(x)Pale) dx = 0 (4.5-4) 
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where u,_ ;(x) is any polynomial of degree n — 1 or less. Then, correspond- 
ing to (4.4-9) and (4.4.10), we have 
b 


H,= w(x)I?(x) dx = [ (ay 


J 
“a 


(x) dx (4.5-5) 


The error term becomes 


1 b 


E= ai { w(x)p2(x) f2(E) dx (4.5-6) 
which, if w(x) does not change sign in (a, b), can be written 
(2n) b 
E= Cae | w(x)p2(x) dx (4.5-7) 


where y lies in (a, b). 

If, in fact, w(x) > 0 in [a, b], then from (4.5-5) it follows that all the 
weights are positive, as we found for the Gauss-Legendre quadrature 
formula. Assuming w(x) to satisfy this condition, we can now prove the 
following theorem. 


Theorem 4.1 The abscissas defined in (4.5-4) are all real and distinct and 
lie within the interval (a, b). 


The importance of having the abscissas real is clear. The importance of 
their being within the interval (a, b) is also clear if one considers the error 
term (4.5-7), whose magnitude depends upon the magnitude of p,(x). 
Furthermore, the integrand may not be defined outside the interval of 
integration. 


PROOF Let a,, ..., a,, be the points of (a, b) where p,(x) changes sign. 
Then 


(x _ a,) —_ (x ~~ An)Pr(X) 
does not change sign in [a, b]. Since p,(x) is orthogonal to all polyno- 


mials of degree less than n with respect to w(x) over [a, b], we have 
b 


J w(x)(x — a) +++ (x — dy)Py(x) dx = 0 (4.5-8) 
unless m =n. But the integrand does not change sign. Therefore, the 
integral cannot be zero, and so m = n, which proves the zeros are real, 
distinct, and lie within (a, b). 


Equation (4.5-1) is the general form of a Gaussian quadrature formula 
with the weights defined by (4.5-5), the abscissas by (4.5-4), and the error by 
(4.5-7). In the next section we consider in more detail the properties of the 
orthogonal polynomials defined by (4.5-4). 
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4.6 ORTHOGONAL POLYNOMIALS AND GAUSSIAN 
QUADRATURE 


Let {,(x)} be a sequence of polynomials, the degree of ¢,(x) being n, ortho- 
gonal with respect to a weight function w(x) over an interval [a, b]. Let the 
coefficient of x" in ¢,(x) be A,, so that @,(x) = A, p,(x), where p,(x) is the 
orthogonal polynomial of the previous section. We introduce the coefficient 
A, so that the orthogonal polynomials to be considered in this and later 
sections can be put in their standard form, which generally has leading 
coefficients not equal to 1. Note that the choice of the leading coefficient has 
no effect whatever on the abscissas or weights. The details of the derivation 
of such sequences of orthogonal polynomials are considered in the problems 
{13 to 16}. We have 
b 


| w(x)bi(x)b(x) dx=0 i #j (4.6-1) 
Ast p 2 
Let Oo, = A, y= I w(x )di(x) dx (4.6-2) 


The basis of the results of this section is the Christoffel-Darboux identity 


5? ge _ Pn+ 1(x)d,(y) _ PrlX)Pn+ i(y) (4.6-3) 
k=0 ln Yn(X ~ y) 
the derivation of which is left to a problem {17}. 
If we set y = a, in (4.6-3), where a; is a zero of @,(x), we get 
"S Puledbrlas) py (x )b,(a i) — PnlX)bn+ 1(4)) 
k=0 Vk On YnlX — aj) 
Now, multiplying both sides of (4.6-4) by w(x)@o(x), integrating over [a, b], 
and using (4.6-1), we get 


(4.6-4) 


b 
Po(a;) Yo — _ Pn+ 1(a;) | w(x) Po(x),(x) dx (4.6-5) 
Yo nn a x — a; 
From the definition of the Lagrangian interpolation polynomial we have 
Pa(X) Pal) 


i) = Go apila) ~ = a,)bila) 46-6) 


Using this, (4.5-5), and the fact that @o(x) is a constant, we can rewrite (4.6-5) 


as 
_ Pn+ 1(@;) p p(x) 
neers arr 
=— Pn+ (4) Pn( 45) Moles ) w(x)1;(x) dx = Put (Gj) PulAj) , a )bn(aj) 


H. 


J 


(4.6-7) 
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_ An+ 1)n j 
A, Pn+ 1(a;)P,,(a;) 
which, given the orthogonal polynomials, is a much simpler way to calculate 


H, than (4.5-5). The definition of y, allows an obvious simplification of 
(4.5-7) to give 


Thus H;= =I,...,7 (4.6-8) 


E = Bom fon (4.6-9) 


These results can be used to simplify the formulas of Sec. 4.4 on Gauss- 
Legendre quadrature. For the Legendre polynomials {19} 


} 2 (2n)! 
— 2 — = - 
y,= [ Pr (x) dx mal A, 2(n!)2 (4.6-10) 


so that H, in terms of Legendre polynomials is 
—2 
Aj =o oa 
(n + 1)P,41(a;)P,,(4;) 


Some other similar forms of H, are considered in {22}. For the error term we 
get in a corresponding manner 


- Gane (4.6-12) 


(4.6-11) 


The results of this section and the next two are summarized in Table 4.4 
on p. 114. In the next two sections, we shall use the results of this and the 
previous section to derive Gaussian quadrature formulas for particular 
weight functions. 


4.7 GAUSSIAN QUADRATURE OVER INFINITE 
INTERVALS 


For integrals over infinite intervals (which we shall assume are convergent), 
there are many possible approaches. We give two here: (1) use a knowledge 
of the integrand to bound the magnitude of the integral from some finite 
value to infinity by a positive constant « > 0 and then use a quadrature 
formula for the remaining finite interval; (2) use a quadrature formula 
especially developed for the infinite interval. The first of these two 
approaches requires no further discussion. The quadrature over the finite 
interval can be performed using Gauss-Legendre quadrature or one of the 
many methods to be presented later in this chapter. In this section we treat 
the latter case. 

For numerical integration over infinite and semi-infinite intervals, it is 
convenient to use a weight function w(x) which assures the convergence of 
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Table 4.2 Zeros of Laguerre polynomials and corresponding 
weights 


n Abscissas a, Weights H, n Abscissas a, Weights H, 

2 0.585786 0.853553 4 4.536620 0.038888 
3.414214 0.146447 9.395071 0.000539 

3 0.415775 0.711093 5 0.263560 0.521756 
2.294280 0.278518 1.413403 0.398667 
6.289945 0.010389 3.596426 0.075942 

4 0.322548 0.603154 7.085810 0.003612 
1.745761 0.357419 12.640801 0.000023 


the integral of w(x) f(x) when f(x) is a polynomial of arbitrary degree. For 
the semi-infinite interval (a, oo) (for convenience we set a = 0), such a weight 
function is w(x) = e *. Therefore, the sequence of polynomials we require 
must be orthogonal over (0, oo) with respect to e *. Such a sequence of 
polynomials is the Laguerre polynomials [see Jackson (1941) and {20}]. The 
polynomial of degree n, L,(x), has leading coefficient A, = (— 1)". For the 
Laguerre polynomials 


vn = [ eaLdtx) dx = (n!)? (4.7-1) 
Then from (4.6-8) and (4.6-9) 
1\2 
j= Paya) (4.7-2) 
- ae fq) (4.7-3) 
The Gauss-Laguerre quadrature formula then has the form 
[eres dx = La, f(a) +E (4.7-4) 


where the a,’s are the zeros of L,(x) and the H,’s are given by (4.7-2) and E by 
(4.7-3). Table 4.2 is a short listing of weights and abscissas for (4.7-4). 


Example 4.4 Evaluate 
| x’e~* dx 


0 


using n = 3 in (4.7-4). 
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Using Table 4.2, we have 


| x7e~* dx = (.711093)(.415775)’ + (.278518)(2.294280)’ 


me) 
+ (.010389)(6.289945)’ = 4139.9 


whereas the true value of the integral is 7! = 5040. The error given by (4.7-3) is 


3! 
E= ey T!n = 252n 


with 7 in (0, co) so it cannot be bounded. Thus the substantial error between the approxi- 
mate and true values is not surprising. This illustrates the fact that the Gauss-Laguerre 
quadrature formula and the Gauss-Hermite quadrature formula to be discussed below 
should be avoided if the derivative in the error term cannot be bounded on (0, oo). In 
this case the first technique mentioned in this section will usually be preferable. Note 
that in this particular example, n = 4 would have led to an exact result (except for 
roundoff) since the eighth derivative of f(x) is zero. 


A generalization of the weight function e~* to x’e~** is considered in 
{20}. 

For the infinite interval (— 00, 00) we choose, for the same reason as 
above, the weight function e~**. The sequence of polynomials we require 
must be orthogonal over (— 00, 00) with respect to this weight function. This 
sequence is the Hermite polynomials [see Jackson (1941) and {21}]. The 
polynomial H,(x) of degree n has leading coefficient A, = 2". 

For these polynomials 


Vn = | eH? (x) dx = Jn 2"n! (4.7-5) 


Then from (4.6-8) and (4.6-9) 


_ 2 nt /m 
i= — Feat. sla) (47-6 


B= 5r wee fn) (4.7-7) 


The Gauss-Hermite quadrature formula then has the form 


[oe F(x) dy = LAS e (4.7-8) 


— &® 


where the a;’s are the zeros of H,(x), the H,’s are given by (4.7-6), and E by 
(4.7-7). Table 4.3 is a short listing of weights and abscissas for (4.7-8). 
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Table 4.3 Zeros of Hermite 
polynomials and corresponding 
weights 


n Abscissas a; Weights H; 


2 +0.707107 0.886227 
3 0 1.181636 
+ 1.224745 0.295409 
4 +0.524648 0.804914 
+ 1.650680 0.081313 
5 0 0.945309 
+0.958572 0.393619 
+ 2.020183 0.019953 


48 PARTICULAR GAUSSIAN QUADRATURE 
FORMULAS 


In this section we consider Gaussian quadrature over finite intervals using 
weight functions of considerable theoretical and practical importance. 


48-1 Gauss-Jacobi Quadrature 


We consider here the weight function w(x) = (1 — x)*(1+x)*, a, B> —1. 
The polynomials orthogonal to this weight function over [—1, 1] are the 
Jacobi polynomials J,(x; a, 8). They are generally defined so that the 
coefficient A, of x" in the polynomial of degree n is given by 


1 T(2n+a+ f+ 1) 


= 7 ar: e 4.8-1 
" 2m! Tinta+ B+ 1) (48-1) 
and we can calculate {24} 
1 
Y= | (L—x)(1 + x IR(x; a, B) dx 
-1 
at+Bpt+i 
2 T(nt+a+ 1) (n+ B+ 1) (4.8-2) 


~al(Qnta+Bt+l) Tintat+ f+ 1) 


Then proceeding as in the previous section, we get the Gauss-Jacobi quadra- 
ture formula (sometimes called the Mehler quadrature formula) 
1 


[ (= x)*( + xf (%) dx = ») H, f(a) +E (4.8-3) 


-1 
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where {24} 
_— _ antat+Ppr2T(n+a+ 1 (n+ B+ 1) 
1 nt+at+B+i Tintat+ Bt i1\(n+ 1)! 
Jat+B 
* JQ; a, B)J n+ 1(4;; a, B) (48-4) 
and pale tat (nt B+ In tat B+ t) 


(2nt+a+ B+ 1)T(2n+a+ B+ 1)]? 


[Q2ntat p+ 
yr SOM (48-5) 


The Gauss-Legendre quadrature formula of Sec. 4.4 is just a special case 
of the Gauss-Jacobi formula with a = £8 = 0. In the next section, we consider 
another special case « = B = —4 because of the importance of the weight 
function in this case and because it serves to introduce the Chebyshev poly- 
nomials, which will play an important role in Chap. 7. 


4.8-2 Gauss-Chebyshev Quadrature 
When « = B = —4 so that w(x) = 1/(1 — x?)!/?, (4.8-1) yields 


(4.8-6) 
but in this case it is customary to choose 


Ag=1 A,=2"™!' n>I (4.8-7) 
Using (4.8-7), we can calculate {26} 


nu 
=< 4.8-8 
Yn 9) ( ) 


The orthogonal polynomials J,(x; —4, —4) are generally denoted by T,(x) 
and are called Chebyshev polynomials of the first kind or usually just 
Chebyshev polynomials. The Gauss-Chebyshev quadrature formula has the 
form 


f- casi (x) dx = DH, f(a)+E (4.8-9) 


where the a,’s are the zeros of T},(x), 


Hy= - a (4.8-10) 
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2n 
d E==.— f@ . 
Equation (4.8-10) can be remarkably simplified by using the result {26} that 


t x) (4.8-12) 


Therefore, at a zero of T,(x) we have cos (ncos''x)=0 and 
sin (n cos” ' x) = +1. Thus 


T,(x) = cos (n cos” 


T+ 1(a;) = cos [(n + 1) cos”! a) 


= cos (n cos” a,)a; — sin (n cos” * a;) sin (cos~* aj) 
= F(l- a;)'!? (4.8-13) 
since a; is a zero of T,(x). Further 
+ 
T,(a;) = sin (n cos * a)——sa3 ==" _ (4.8-14) 


(1— a2)? ~ (1a)? 


Using (4.8-13) and (4.8-14) in (4.8-10), it follows that 


Tt 
H;= - (4.8-15) 
From (4.8-12) it also follows that 
(2j-1jn 

aj = COS J= l,..., n (4.8-16) 

Thus all the weights are equal, and (4.8-9) can be written 

1 1 xn! 

[ aaxpial ) dx =| Dy f(a) +E (4.8-17) 


Example 4.5 Evaluate 


correct to six decimal places. Applying (4.8-11), we find that the error E, in using the 
n-point rule (4.8-17) is given by 


2n 


E=——e" — 
so that |E,| < one = B, 
27"(2n)! 


For n= 4, B, = 1.66 x 107°, and for n= 5, B, = 4.6 x 10° °. Hence, we apply (4.8-17) 
with n = 5 to get 


i~ 


cos 


2i — 
(2i a = 3.977463 
10 


wala 


5 
> exp 
j=l 


correct to six decimal places. 
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4.8-3 Singular Integrals 


A common problem in numerical analysis is the need to evaluate integrals in 
which the integrand has a singularity. If the singularity results in an im- 
proper integral, as in the case of 1/(1 — x)'/? on [0, 1], we assume the integral 
is convergent. But we must also consider singularities of the form (1 — x)'/? 
on [0, 1]. For in both cases, a Gaussian quadrature approach without weight 
functions would lead to trouble because of the derivative of the integrand 
which appears in the error. 

The Gauss-Chebyshev quadrature formula is a good example of the 
value of the weight-function approach to singular integrals. For if f(x) in 
(4.8-9) is analytic, the effect of making the singular term 1/(1 — x”)'/? the 
weight function is to remove this term from the summation and the error 
term on the right-hand side of (4.8-9). In this section we shall consider some 
other applications of this technique to common types of singularities. We 
shall restrict our discussion to integrands with singularities at the endpoints 
of the interval. However, by splitting the integral into two integrals, singular- 
ities in the interior of the interval can also be handled. 

The general problem then is to find quadrature formulas of the form 


b n 

| s(x) f(x) dx = DH, f(a) +E (4.8-18) 
a j=i1 

where the weight function s(x) is singular at one endpoint or both. Without 
loss of generality, we shall restrict ourselves to [a, b] =[—1, 1] or [0, 1], 
whichever is most convenient. 


Case 1: s(x) = (1 — x?)'/? on [—1, 1]. This is the Jacobi weight function 
with a = B = 4. The resulting orthogonal polynomials are called Chebyshev 
polynomials of the second kind. Let the polynomial of degree n be S,(x) 
[sometimes denoted in the literature by U,(x)]. Then, analogous to (4.8-12), 
we have the relation {30} 


sin [(n + 1) cos" x] 


= 4.8-19 
Sul) sin (cos ' x) ( ) 
The abscissas in (4.8-18) are therefore {30} 
jn . 
a; = Cos n+l J= l, 225 N (4.8-20) 
and, using (4.6-8), we can calculate {30} 
nN . 4 JN 
= ——~ sin? -—- 4.8-21 
Mt n+i- ont] ( ) 
Using (4.6-9), we find {30} 
_ n 2n) 
E=~>am yt (n) (4.8-22) 


J2n+ 1(2n) 
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Case 2: s(x) = 1/,/x on [0, 1]. Here we have a singularity at one endpoint. 
By manipulation of the orthogonality integral (4.6-1), we can show that the 
orthogonal polynomial of degree n, p,(x) is given by {31} 


Px(X) = Pan(s/X) (4.8-23) 


where P,,(x) is the Legendre polynomial of degree 2n. Corresponding to 
each positive zero a; of P»,(x), there is then an abscissa of (4.8-18) given by 


a; = a (4.8-24) 
Using (4.6-8) and (4.8-23), we can also show that {31} 


where h, is the weight corresponding to «,; in the Gauss-Legendre formula of 
order 2n. For the error we get {31} 
ant '[(2n) "? 


E= (4n + 1)f(4n)'P fen) (4.8-26) 


Example 4.6 Evaluate 


h1h+x 
(4 
using n = 2 in (4.8-18). From Table 4.1 and (4.8-24), we have 
a, = (.339981)? = .115587 a, = (.861136)? = .741555 
and H, = 1.304290 H, = .695710 
Therefore, our approximation to the integral is 
‘1+x 
ewe 
whereas the true value is §. Since f(x) = | + x, the derivative in (4.8-26) is zero and so, 
except for roundoff, the result is exact, as it should be. 


dx = (1.304290)(1.115587) + (.695710)(1.741555) = 2.666666 


Case 3: s(x) = ./x on [0, 1]. Again the singularity is at one endpoint, but 
this time the singularity is in the derivatives of the function. In a manner 
similar to case 2, we can show that {31} 


| 
PAX) = Jx Pont 1(,/x) (4.8-27) 
Therefore, if «; is a positive zero of P»,4 (x), the abscissa in (4.8-18) is given 
by 
a; = « (4.8-28) 
Again using (4.6-8), we find that the corresponding weight is given by {31} 
H, = 2h,o? (4.8-29) 
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where h, is the weight corresponding to a; in the Gauss-Legendre formula of 
order 2n + 1. Finally, for the error we get {31} 

24"* 31 (2n + 1)!]* Cm) 

= eee SS 4.8-30 

(4n + 3)[(4n + 2)!]?(2n)! pf) ( ) 

Case 4: s(x) = [x/(1 — x)]'? on [0, 1]. This time the weight function has a 

singularity at one endpoint, and its derivatives have a singularity at the other 

endpoint. In this case we get {32} 


p(X) = Fe Tone ab (48-31) 


where T>,,4 (x) is the Chebyshev polynomial of degree 2n + 1. From this it 
follows that 


4 (2 — 1)n ; 
aj = cos" (4.8-32) 
From (4.6-8) we get {32} 
20 
H, mai (4.8-33) 
and from (4.6-9) 
(2n) . 
E= Opi FE" (n) (4.8-34) 


The results of Secs. 4.6 to 4.8 are summarized in Table 4.4. 


4.9 COMPOSITE QUADRATURE FORMULAS 


We must face the same problem in choosing n, the number of abscissas in a 
quadrature formula, that we faced in choosing the number of points in an 
interpolation formula. As n gets larger, the constant in the error term 
decreases but the order of the derivative increases. One problem in using 
large values of n is the difficulty of estimating high-order derivatives. 
Another is the fact we mentioned in Sec. 3.4 that ultimately the derivatives of 
all but certain entire functions increase without bound. This result is most 
easily derived by showing that any function f(z) of a complex variable with 
bounded derivatives of all orders at a point z) must be entire. The Taylor- 
series expansion of f(z) about Z 9 is given by 


fle)= LAje— zy 4,= 2 (49-1) 
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Table 4.4 Summary of Gaussian quadrature formulas of the form 


[we f(x) dx & YAS) 


Weight Interval Abscissas a; Weights H, given —_ Error given 
function w(x) {a, b] are zeros of by Eq. by Eq. 
1 {—1, 1] P,(x) (4.6-11) (4.6-12) 
e* (0, 00) L(x) (4.7-2) (4.7-3) 
e* (—00, 0)  H,(x) (4.7-6) (4.7-7) 
(1—x)(1+x)@ [-1, 1] J,(x; a, B) (4.8-4) (4.8-5) 
1 _ (4 — ian 
(1 — x?)!/? [-1, 1] S,(x) = cos me , (4.8-21) (4.8-22) 
[0, 1] P,,(./x) (4.8-25) (4.8-26) 
Jx 0, 1 a PaseilVX) (48-29) (48-30) 
( =} ° (0, 1] Tae ihV%) (4833) (4.8-34) 
_ Jx 


If all the derivatives of f(z) at z) are bounded so that | f(zo)| <M for 
some M and all j, then from (4.9-1) 


00 00 _ j 

if(z)| < ¥ |Aj|lz—zoh< MD |2—20f = Mel?-*l (49-2) 
j=0 j=0 J: 

Since this implies the Taylor series converges for all finite z, f(z) is entire. 

When f(z) is not entire, we can estimate the rate of growth of the 


derivatives by writing the coefficients in (4.9-1) in the form 
I f (2) 
-= > —*“+__d 
J 2ni bony 


where C is any circle centered at z) of radius less than the radius of 
convergence R of the Taylor series. If L is a bound on the magnitude of 
f(z) in the circle |z — z9| < R, then it follows from (4.9-3) that 


(4.9-3) 


| | Lj! 
| f (Zo) | =! | A, < Ri 


Therefore, the derivatives may grow eventually as fast as j!/R/. Since in fact 
rapid growth of derivatives may start for quite low derivatives (see, for 
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example, Example 4.5) and because of the difficulty in estimating high deriva- 
tives, there is a tendency not to use high-order quadrature formulas. But 
low-order ones may not be sufficiently accurate. Consider, for example, the 
error term in the Gauss-Legendre quadrature formula (4.6-12). Suppose f (x) 
is such that we can estimate its derivatives only up to order 4 but that for 
n= 1 and 2 the bound on E does not assure us of the accuracy we desire. To 
avoid this predicament, we can proceed as follows: (1) break up the interval 
[a, b] into a number, say m, of subintervals; (2) on each subinterval apply a 
quadrature formula and sum the results. 

The effectiveness of this technique depends upon the fact (which may 
have been obscured by our use of the interval [—1, 1]) that the error in all 
the finite-interval quadrature formulas we have developed is proportional to 
some power of the length of the interval. To see this, consider the problem of 
evaluating 


| f(x) dx (4.9-4) 


by Gauss-Legendre quadrature. We first make the change of variable 


1 1 


y= 7 (2x — a — 6) x=5[(b—a)y +a + }] (4.9-5) 
which changes (4.9-4) to 
fre) ax =°54) aay (49-6) 
with gly) =f[2(b — a)y + 2(a + 5) 


When Gauss-Legendre quadrature is used, the error in J+, g(y) dy is given 
by (4.6-12) as 
Jent ‘(n 1)4 
E — (2n)(>5 " — Je 
(2n + 1)[(2n) 1 g"(7i) ne (-1, 1) (4.9-7) 
where the derivative is with respect to y. But using (4.9-5), we have 


90) = aE =" = f(x) (4.9-8) 


Thus the error incurred in [° f(x) dx is 


_ b—a 2n+1 22"*1(n1)4 on 
e=- ("5 er inomys lm ne (ad) (49-9) 


If now we write 


b (a+ b)/2 b 


| f(x) dx = | f(x) dx + | f (x) dx (4.9-10) 


a a (at+b)/2 
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and apply to each of the integrals on the right-hand side of (4.9-10) a Gauss- 
Legendre formula with n points, it follows as above that the error is 


_ | {b—a 2n+1 22"*1(n1)4 on on 
B= sms(—5-] (n+ iyfamip Lt nr) +f (n2)] 


b b 
nN, € (2°); N2 € ea ) (4.9-11) 


which, assuming continuity of the 2nth derivative of f(x), can be written 


! (* a 22+ 1(q1)4 


20 2n + 1)[(2n)!]? 


Thus dividing the interval in two and using an n-point formula in both 
intervals has brought an extra factor of 1/22" into the error term, leaving 
everything else the same [except that the 2nth derivative is evaluated at 
different points in (4.9-9) and (4.9-12)]. If, instead of two intervals, we had 
divided [a, b] into m intervals, the added factor would have been 1/m?" {36}. 
Thus the procedure outlined above is indeed an effective way of performing 
numerical quadrature more accurately by adding abscissas to the interval 
(a, b] but without increasing the order of the quadrature formula used. 

Performing the integral over [a, b] of f(x) by the method outlined here is 
familiar to all users of the trapezoidal and parabolic rules, which we shall 
consider in Sec. 4.10-1. As an example in the Gaussian context let us con- 
sider evaluating (4.9-4) by dividing [a, b] into m intervals and using a Gauss- 
Legendre quadrature formula with n = 2 over each interval. Using Table 4.1, 
the integral is approximated as 


fP(ns) Nz € (a, b) (4.9-12) 


- Jan 


h 1 

+sla+3 ft. +2)| (4.9-13) 
2 /3 

where h = (b — a)/m. The error in (4.9-13) is given, using (4.9-12) with 22” 

replaced by m2", by 


b—a/) .. he 
= OOD p(y) = Epon (49-14) 


A quadrature formula of the type (4.9-13), which is the sum of a number of 
quadrature formulas over subintervals, is called a composite quadrature 
formula, a composite rule, or sometimes a compound rule. From (4.9-14) we 
see that if f'’(y) is bounded in [a, b], by increasing the number of subinter- 
vals we can make the error arbitrarily small. In general, if the derivative 
in the error term is bounded on the interval of integration, then, in the 
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limit as the number ofsubintervals m — oo, the approximation must converge 
to the true value of the integral as 1/m?". Indeed even if this derivative 
is not bounded, any composite rule will converge to the integral of a 
Riemann-integrable function provided only that the sum of the weights of 
the underlying rule equals the length of the basic integration interval and 
that the abscissas lie in that interval. This is true since any composite rule 
can be written as follows as a weighted sum of Riemann sums. Let 


i=1 i=1 
be a quadrature rule approximating I(f) = {3 f(x) dx. Then the composite 
rule using m subintervals Q,,(f ) has i. rit 


-"y y mS (5: (x; +) = TH, at (at (+a) (4.9-16) 


j=0 i=1 i=1 j=0 


The right-hand formula is a weighted sum of the Riemann sums 


-> wt (= (x +) i=0,...,.m—1 (4.9-17) 
j=o Mm 

Hence, if S; converges to I(f) as m— oo for all j, then Q,,(f) will also 
converge to I(f) since 


lim Q,,(f)= lim )A;S;= | lim s,) - y H, I(f)=I1(f) (4.9-18) 
m— co moo i=1 i=1 m— co i=1 

For this reason composite quadrature formulas are used in most numerical 
quadrature work. 

A further advantage of composite formulas will be indicated in 
Secs. 4.10-1 and 4.10-2, where we shall develop a method to get a more 
accurate result than that given by two or more composite quadratures each 
using a different number of subintervals. 


Example 4.7 Evaluate the integral of Example 4.3 using (4.9-13) with m = 2. 
We have h = 1; therefore, 


dx 12 1 | 
—— + 
me TG ya) dla 
2 3 2 j3 
= 1.097713 


Using (4.9-14), the error is bounded by 


11 i 
000046 x ——. <E<— x 011111 
90 (3)5 90 


and, in fact. we have achieved a substantially more accurate result than in Example 4.3 at 
the cost of using four abscissas instead of three. 
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4.10 NEWTON-COTES QUADRATURE FORMULAS 


We noted in Sec. 4.4 the advantage of having abscissas and weights which 
are simple numbers when using a desk calculator. To the reasons for this 
adduced previously we might add that simplicity of numbers also serves to 
reduce blunders. In this section we shall develop a class of quadrature for- 
mulas that are ideal in this sense for hand-calculator work. Most readers are 
undoubtedly familiar with the most common members of this class—the 
trapezoidal formula and Simpson’s rule. It should not be thought, however, 
that this class of formulas is applicable mainly for hand rather than automa- 
tic computation. Frequently a member of this class is the best method to use 
on a digital computer. 

A quadrature formula in which the abscissas are constrained to be 
equally spaced is called a Newton-Cotes quadrature formula. They are thus, 
in particular, ideally suited for the numerical quadrature of tabulated func- 
tions. Newton-Cotes formulas of practical value fall into one of two classes: 
(1) closed formulas, where the endpoints of the interval are abscissas, and (2) 
open formulas, where the endpoints are not abscissas and the other abscissas 
are symmetrically placed with respect to the endpoints. Other types, such as 
half-open formulas, are possible but are seldom used in practice for 
quadrature. 

Closed Newton-Cotes quadrature formulas have the form 


b n 

| w(x) f(x) dx= YH, flat hi) +E (4.10-1) 
“a j=0 

where h = (b — a)/n and w(x) is a weight function. Thus (4.10-1) involves 
n+ 1 abscissas; we have summed from 0 to n (in contrast to our previous 
notation) for notational simplicity in what follows in this section. Also for 
convenience of notation let a; = a+ hj so that a=adp) and b=a,. Then 
(4.10-1) becomes 


| wo) f(x) dx = Hj f(a) +E (4.10-2) 
ao J= 
We have the n + 1 weights H, at our disposal, so that we expect to be able to 
make (4.10-2) exact for polynomials of degree n or less. In fact we shall see 
that when n is even, we get exactness for polynomials of degree n + 1 also 
{38} provided the weight function w(x) is symmetric on the interval [ap, a,]. 
To determine the weights, we use the Lagrangian interpolation formula 
with the abscissas chosen as above, multiply both sides by w(x), and inte- 
grate between a, and a,: 


an 


[ w(x) f(x) de = YH flay) + 


“ao 


Qn 


rays |, 9 IPaw ibs) £6) dx 
" (4.10-3) 
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where H,= | w(x)l (x) dx (4.10-4) 


a0 
and p,+ (x) = (x — do) -:: (x — a,). Because of the (n + 1)st derivative in the 
error term, the H's given by (4.10-4) make (4.10-2) exact for polynomials of 
degree n or less. These H,’s are called Cotes numbers. 

The simplification of the error term in (4.10-3) is considerably more 
difficult than before because even if w(x) does not change sign in [ao, a,], 
Pn+ (x) does. Thus the second law of the mean cannot be applied. In order to 
simplify the error, we assume that w(x) is an even function with respect to 
the center of the interval, that is, w[x — (ap + a,)/2] = w[(do + a,)/2 — x], 
and consider first the case where n is even. Integrating by parts, we have 


an 


B= | me DPerila) 7 M(G) dx 


(n+ 1)! /,, 
I “n d (n+ 1) 
=— (n+ 1)! q(x) 5 Lf (¢)] dx (4.10-5) 
where alc) = [ w(e)Pn+ s(t) de (4.10-6) 


ao 


The part integrated out in (4.10-5) is zero because q(ay) = 0 and q(a,) is also 
zero. The latter follows because p,,, ,(x) is an odd function of x — (ap + a,)/2 
(why ?), and we have assumed w(x) is an even function of the same argument. 

In the remainder of this section we shall consider the particular case 
w(x) = 1. We leave to a problem {39} the result that in this case q(x) is 
of constant sign in [ag, a,]. Then since it can be shown that 
(d/dx)[ f"* (E)] = [1/(n + 2) f%* (ny), with 1 in the same interval as 6, 
[Ralston (1963)] we get the result that 


= LE) te SO Fg) 430-9 


the latter result following from an integration by parts {39}. Therefore, with n 
even the Newton-Cotes closed formula with w(x) = 1 is exact for polyno- 
mials of degree n+ 1 or less. When n is odd, the derivation is somewhat 
more difficult {40}; the result is that 

fern) 


E=Ty [Pax ile) dx (4.10-8) 


which is the result that would be obtained if the second law of the mean 
could in fact be applied to the error term in (4.10-3). 

For the case w(x) = 1 the weights are given in Table 4.5 for values of n 
from 1 to 8 (where n+ 1 is the number of abscissas). The value of H; in 
(4.10-2) is given by hAW,, where h is the spacing of the abscissas. Also given 
are the coefficients of h** 1f(n) in the error term, where k = n + 1 ifnis odd 
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Table 4.5 Weights and error-term coefficients for Newton-Cotes closed 
formulas 


Error 
n A W Ww, W, W, W, coefficient 
1 1/2 1 1 — 1/12 
2 1/3 ] 4 | — 1/90 
3 3/8 | 3 3 1 — 3/80 
4 2/45 7 32 12 32 7 — 8/945 
5 5/288 19 75 50 50 75 — 275/12,096 
6 1/140 41 216 27 272 27 — 9/1400 
7 7/17,280 751 3577 1323 2,989 2989 — 8183/518,400 
8 4/14,175 989 5888 — 928 10,496 — 4540 — 2368/467,775 


and n + 2 if nis even. Because of the symmetry of the weights, only a portion 
of the complete table need be given. The case n= 1 is the trapezoidal 
formula, and n = 2 is Simpson’s rule. 

For the same reasons discussed in the previous section, high-order 
Newton-Cotes formulas are seldom used. Note that some of the weights for 
n = 8 are negative. In fact, it can be shown {41} that only for n < 7andn = 9 
are all the weights positive. The sum of the weights is always the length of the 
interval (why?); therefore if some of the weights are negative, this adversely 
affects roundoff error [cf. (1.4-7)]. 

Open Newton-Cotes formulas have the form 


an 


| w(x) f(x) dx = SH, f(a) +E (4.10-9) 


a0 


In this case we have n — | weights. For n odd (4.10-9) is exact for polyno- 
mials of degree n — 2 or less, but for n even, as in the case of the closed 
formulas, we get a bonus in that (4.10-9) is exact for polynomials of degree 
n — 1 or less {42}. The derivation of the error term is similar to that for the 
closed formulas {42}. 

Table 4.6 is the analog of Table 4.5 for open formulas. The error 
coefficients are the coefficients of h‘*'f(n), where k =n for n even and 
n — 1 for n odd. Note that the sign of the error coefficient in the open rules is 
the opposite of the sign in the closed rules. 

For n = 2, we have the midpoint rule, which is better known in the form 
it takes when the interval has length h instead of 2h 


i fe) dx = f(a+5) +") a<té<a+th (410-10) 


The composite midpoint rule is competitive with the composite trapezoidal 
rule both with respect to efficiency and error term. It 1s preferable to the 


NUMERICAL DIFFERENTIATION, NUMERICAL QUADRATURE, AND SUMMATION 121 


Table 4.6 Weights and_ error-term 
coefficients for Newton-Cotes open 


formulas 
Error 
n A W, W, W, coefficient 
2 2 1 4 
3 2 1 7 
4 3 2 —1 2 i¢ 
5 3 11 1 1 25 
6 3 11 —14 26 7s 


latter when the integrand has singularities at the endpoints of the integration 
interval. Aside from the midpoint rule, there is seldom any advantage in 
using a Newton-Cotes open-type formula in preference to the closed formula 
using the same number of abscissas (see Sec. 4.12), but as we shall see in 
Chap. 5, open-type formulas do have application to the numerical integra- 
tion of ordinary differential equations. The weights in the other open-type 
formulas are all positive only for n = 3 and 5. 

Newton-Cotes formulas with weight functions other than w(x) = 1 are 
sometimes of use also, particularly for singular integrands of the type con- 
sidered in Sec. 4.8-3. We shall not consider such weight functions here but 
refer the reader to the references in the bibliographic notes. 


4.10-1 Composite Newton-Cotes Formulas 


The same reasons for wishing to use Gaussian quadrature formulas in com- 
posite rules apply to Newton-Cotes formulas. Moreover, there is one further 
advantage to composite rules involving closed Newton-Cotes formulas. 
Since the endpoints of each subinterval (except for the endpoints of the 
whole interval) are abscissas for two subintervals, the total number of 
abscissas used is not equal to the number m of subintervals times the 
number n + 1 of points in each subinterval but is m — 1 less than this number. 
This can be best illustrated using n = 1 in Table 4.5. Suppose the interval 
[a, b] is divided into m subintervals of length h and we use (4.10-2) with 
n = 1 on each subinterval. Then our approximation is 


b 


[ f(x) dx @ HE So +f + fat +Sm—2 + fn + 4Sn) (410-11) 


where f; = f(a + hj) and h = (b — a)/m. Although there are m subintervals 
and each application of (4.10-2) uses two abscissas, there are only m+ 1 
abscissas in (4.10-11). Equation (4.10-11) is, of course, the trapezoidal rule. 
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The error is given by 


“£(n) = — pr(q) = — P= prin) (4.10-12) 


where 7 is in (a, b). As in the case of (4.9-13), the error can be made arbi- 
trarily small by making m sufficiently large [assuming f"(n) is bounded in 
(a, b)t]. 

As another example, we divide the interval [a, b] into m/2 intervals of 
length 2h (m even) and use (4.10-2) with n = 2 over each subinterval. Then 
we get 

b 


[FO de =F (fot fit Bat Ms 


+e t 4fn—3 + 2fin—2 + 4fin-1 + fin) (4.10-13) 
with h = (b — a)/m. The error is given by 


b—a) .. mh? .. b —a)h* .. 
E=-— v4) fn) =~ so! (") ~~ OH al f(y (4.10-14) 
Equation (4.10-13) is the parabolic rule. 

Similar composite rules could be derived using other Newton-Cotes 
closed formulas or using Newton-Cotes open formulas, although in the 
latter case, because the endpoints are not abscissas, we do not get a reduc- 
tion in the total number of abscissas. One advantage of the low-order compo- 
site rules, particularly the trapezoidal rule, is the lack of fluctuation in the 
coefficients which results in good roundoff properties [cf. (1.4-7)]. When an 
integral is approximated using a large value of m in a composite rule, round- 
off error may become a significant factor. Since the expected roundoff error 
will be minimized when the coefficients are most nearly equal, the trapezoi- 
dal rule has almost ideal roundoff properties. Higher-order Newton-Cotes 
formulas in which some of the weights may even be negative have quite bad 
roundoff properties. Note that the Gaussian composite rule (4.9-13) has 
ideal roundoff properties. 

Suppose we use (4.10-11) to estimate the |? f(x) dx for two separate 
values of m, m, and m,. Let J be the true value of the integral and J, and 
I, the respective approximations. Then, using (4.10-12), 

b — a)? 
r= - SS pn) t= FSF ym) (410-15) 


where ny, and n, are both in (a, b). Now suppose we assume that the two 
second derivatives are equal and eliminate these derivatives from (4.10-15), 


ft In fact, since the trapezoidal rule is a Riemann sum, it converges as m— oo for any 
function f(x) continuous on [a, 5]. 
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obtaining 


2 2 
_m3zt,— my, 


Ix 5 (I, — 1) (4.10-16) 


m2 — m? m3 — mi 
The value of this approximation clearly depends on how good the assump- 
tion is that the two second derivatives are equal. Suppose that we have a 
sequence of approximations to J, corresponding to an increasing number of 
subintervals, which appear to be converging monotonically. Then this 
assumption is probably good since the differences in the errors are probably 
being caused mainly by the m? term in the denominator of (4.10-12). Equa- 
tion (4.10-16) is just another example of Richardson extrapolation, discussed 
in Sec. 4.2. 

From a computational point of view, it is desirable to choose m, = 2m,, 
for in this case all the abscissas used in the computation with m, subintervals 
are also abscissas for the m, subintervals calculation (why?), thus reducing 
the number of evaluations of f(x). This consideration will be important in 
the next section. 


Example 4.8 Evaluate the integral of Example 4.3 using (4.10-11) with m = 2 and 4 and 
then use (4.10-16) to obtain a third approximation. 
When m = 2, we have 


-3 dx 
| Se 1G) +2+4@)]=2 
1 


When m= 4 wa +343 +4 + 26)] = 8 


Then from (4.10-16) 
f dx ~ 67 4 


we + (—.05) = 
1 x 60 ie— a! )= 10 


with the true result In 3 = 1.098612. Note that the Richardson extrapolation gives an 
improved value, but even this value is not so good as that using (4.9-13), where the error 
term has the fourth power of the number of subintervals in the denominator. 


4.10-2 Romberg Integration 


Any computation based on a fixed spacing h is then a candidate for Richard- 
son extrapolation. To extend the procedure outlined here, we may consider 
performing M computations each using a different value of h, that 1s, a 
different number of subintervals in the quadrature context, and then elimin- 
ate the first M — 1 terms in (4.2-5). An intuitive objection to carrying this 
procedure too far would be that because of our discussion of high-order 
derivatives in Sec. 4.9, we would expect the higher powers of h in the error 
term to be accompanied by increasingly large coefficients. Therefore, we 


124 A FIRST COURSE IN NUMERICAL ANALYSIS 


should not be surprised if this procedure did not converge. However, in the 
case of the trapezoidal rule, it does converge and leads to the following 
elegant and useful method of Romberg. 

It is perhaps not obvious that the error in the trapezoidal rule can be 
expressed in the form (4.2-5). But, as we shall derive in Sec. 4.13-1 [cf. 
(4.13-17)], the trapezoidal rule can indeed be written as 


b 00 
| f (x) dx =hGfo thi that +Sm-1 + ttm) + Doajh™)  (4.10-17) 
a j=1 
with h = (b — a)/m, where the a,’s depend only on a, b, and f (x) [cf. (4.10-12)] 
and where the series on the right-hand side of (4.10-17) is asymptotic [cf. 
Example 4.11]. Now let 


J= | f(x) dx (4.10-18) 


b 


Fa tf tft tfeer t4fy) (410-19) 


and let To, = 5 


be the trapezoidal-rule approximation for 2" subintervals. Then 


00 b —a 2j 
Ty ,=J - y a=] (4.10-20) 
j=l 
We define 
Th = 4(4T) , 41 — To, x) k = 0, 1, eae (4.10-21) 


Using (4.10-20), we have 


ee) _ 2j 
=J- (3 - t)a("5"] k=0,1,... (410-22) 


Equation (4.10-22) states that if we perform the trapezoidal rule to approxi- 
mate J using a spacing h, = (b — a)/2**' and h, = (b — a)/2*, the resulting 
approximation has a leading term in the error of the order of h$. This 
equation is therefore directly analogous to (4.2-7). The approximation T, , 
is, in fact, precisely the parabolic rule for 2* subintervals {47}. 
Now, in general, we define 
Tua =a (4" Tent _ Tn —1,k) : : 


7" (410-23 
4" —1 vee ( ) 


The approximation T, , is the composite rule found using the Newton-Cotes 
closed formula with n = 4 and 2* subintervals, but for m > 2, there is no 
direct relation between T,, , and a Newton-Cotes composite rule {47}. 
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Using (4.10-24), we can write 


Tin. k = 


j 


Cm.m-—j To, kj (4.10-24) 


ill 


That is, each T,, , is a linear combination of trapezoidal rules using 2*, 2** a 


., 2*+™ subintervals. Moreover, analogously to (4.10-22), we can show that 
the leading term in the error of T,, , aS an approximation to J is of the order 
of [(b — a)/2*]?"* » {47}. We would arrange the calculations as indicated in 
the following table: 


(4.10-25) 


To, m Ty m-1 eae eee cee Tn, 0 
That is, using the m+ 1 evaluations of the trapezoidal rule in the first 
column, we can then, using (4.10-23), calculate all the remaining entries in 
(4.10-25). We know that if f(x) is Riemann-integrable in [a, b], then as 
m— 00, To, converges to J. But here we are interested in proving that as 
m-— 00, Ty, , converges to J. Note that since T,, , has a leading term in us 
error of the order of [(b — a)/2*]2"* while that of To. , is [(b — a)/2*]?, w 
would hope that if T,, , converges, it would converge much more rapidly 
than To ,. 
Using (4.10-24), we can write 


mel fee 4. Ol 
Tho — | 11 C10 ; (4.10-26) 


Toy, 0 Con _ i Cmo To, m 

We need first a means for generating the coefficients c,,;. We leave to a 
problem {48} the proof of the result that the c,,; are the coefficients of 

;_4- z)(4* — z) ++: (4"—z) 


= Sieg zi = OM TA 27) 4.10-27 
Dom = Gritty a1) 107) 
In particular 


m "A+1 24+] 
© lem = tm -)= Tg ~<a (4.10-28) 


Since this infinite product converges {48}, the sum of magnitudes of the 
elements in any row of (4.10-26) is bounded. It also follows from (4.10-27) 
that Cy, m—; 7 0as m—> oo {48}; that is, each column in (4.10-26) converges to 
Zero. Finally, yg Cmj = tm(1) = 1. The coefficient matrix in (4.10-26) there- 
fore satisfies the conditions for Toeplitz convergence [see Zygmund (1952)], 
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which means that since 75 _,, converges to J, so does T,, 9. In Example 4.10 in 
Sec. 4.12, we shall consider a problem in which T,, 9 not only converges 
but converges much more rapidly than Tp ,,. 
Equation (4.10-24) can also be written in terms of ordinates as 
am+k b —a 


j=0 


We leave to a problem {49} the proof of the result that d,,, is positive for all 
and m. Therefore, in contrast to the higher-order Newton-Cotes formulas, 
the coefficients in Romberg integration do not change sign. 

We shall not derive the error term in Romberg integration here, but it 
can be shown [see Bauer, Rutishauser, and Stiefel (1963)] that this error can 
be expressed in the standard form of a constant times h?"*? times a 2m + 2 
derivative of f(x) with h as in (4.10-29). The convergence of the sequence 
{T,,.o$ implies that this constant becomes small very rapidly (why?). 

The Romberg integration scheme is of great theoretical interest and has 
also been used in some quadrature routines and even in some adaptive 
schemes; however, it turns out that the number of functional evaluations 
needed grows too rapidly with m. However, in the light of (4.2-10) we can 
choose a sequence {h,} other than the sequence {h/2"} used above and still 
have the benefits of Richardson extrapolation. The formulas for generating 
the T table (4.10-25) are only slightly more complicated than (4.10-23). The 
particular choice of {h,} = {(b — a)/p,}, where {p,} = {1, 2, 3, 4, 6, 8, 12,...}, has 
been recommended by Oliver (1971) as optimal in the sense of giving the 
best accuracy with the least amount of roundoff for a fixed amount of 
computation. Here, the p,, are all the integers of the form 2* and 3(2*), so that 
all evaluations of the integrand are used in the computation of succeeding 
trapezoidal sums. 


4.11 ADAPTIVE INTEGRATION 


Whenever we compute an integral numerically by some quadrature formula, 
we cannot know how good the approximation is unless we know something 
about the error. There are various ways to estimate the error. We can try to 
bound the derivative in the error term, but, as we noted earlier, this is usually 
unsatisfactory because, on the one hand, computing the derivatives of a 
complicated function or otherwise estimating them can become very tedious 
and, on the other hand, due to the variation in the higher derivatives, a 
bound on the derivative usually yields an error estimate far in excess of the 
true error, so that we end up evaluating the integrand at many more points 
than are actually needed. Another possibility is to compute a sequence of 
approximations to the integral and stop when two or three of them agree to 
the number of significant figures desired. This sequence may consist of com- 
posite rules of the same underlying primitive rule, Romberg approximations 
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(as discussed in the previous section), Gauss rules of increasing order, etc., 
with an attempt made to try to use information generated previously in 
subsequent approximations. A variation of this idea is to compute sequences 
of pairs of comparable rules using different evaluation points and stop when 
there is agreement in one or several successive pairs. In all these schemes 
the integral is treated in a global manner; 1.e., we always generate approxi- 
mations over the original integration interval. Now, while it is relatively 
easy to implement these schemes, they may involve a considerable amount 
of unnecessary computation. By treating the entire interval in a uniform 
manner, they do not distinguish between the places where the integrand is 
well-behaved and those where it is not. Bad behavior in a small part of the 
interval can cause the numerical approximation over the entire interval to 
have a substantial error. A remedy for this situation would be to isolate 
those sections of the interval where the function 1s ill-behaved and work hard 
there while approximating the integral where the integrand is well-behaved 
with a relatively small amount of computation. The implementation of this 
idea leads to an adaptive integration scheme. 

Many such schemes exist; we shall describe one in some detail and make 
some comments on others. 

The scheme which we shall describe is an adaptive version of Simpson’s 
rule which yields an approximation /,,(f) to 


1(f) = R(t) =| f(x) dx (4.11-1) 
such that, it is hoped, either 

Wf) —Tas(f)| <€ (4.11-2) 
or (ff) —Tas(f)| < ICS |) (4.11-3) 


for some preassigned tolerance « > 0. Here we shall use the absolute error 
estimate (4.11-2). The estimate (4.11-3) can be treated similarly with 
successively better approximations to I(| f |) as more functional values are 
computed. 

Our approximation /,,(f) will have the form 


I (f) = Y Rola ai+ilf (4.11-4) 


where R, fis the Simpson five-point rule, i.e., the parabolic rule with m = 4 
[cf. (4.10-13)}: 


Rola;, a4 1]f = mt F(a) + af(a + “| + 2a + *) 


3h; 


+ (a+ 78) + Fae} (4.113) 
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where h; = a;,, — a; and the a,’s are a subdivision of [a, b]: 
a=aA) <a, <°''<a,_,<a,=) (4.11-6) 


The subintervals [a;, a;,,] are, in general, of different length, but each has 
length H/2’ for some r, where H = b — a. A subinterval of length H/2’ is said 
to have level r. The various subintervals are determined by taking the level 0 
interval [a, b], subdividing it into two level 1 subintervals [a, a + H/2], 
[a + H/2, b], then perhaps subdividing either or both of these into two level 
2 subintervals and continuing this way until (4.11-6) is obtained. Whether or 
not a given level r subinterval is divided into two level r + 1 subintervals is 
determined by a comparison of the result of (4.11-5) on this subinterval with 
the result of Simpson’s rule: 


hy 
R,[a;, Qi+ ilf = 6 


as we shall now describe. 

At a particular stage in the computation, suppose we have a subinterval 
[a;, a; + h] at level r, so that the numbers ao, a,, ..., a; have already been 
determined and the approximation 


f(a;) + 4(a + = + Flas.) (4.11-7) 


») R,[a;-1, aj) f * | f(x) dx (4.11-8) 


has been accepted; i.e., the estimated error is less than (a; — a)e/H. The rest 
of the values of the final subdivision are not yet available [although a subset 
of these numbers together with the corresponding function values is available 
and can be used, e.g., to estimate I(| f|)]. Now R,[a;, a, + h] is usually 
available from a calculation at a lower level, together with the values f(a;), 
f(a; + h/2), and f(a,+h). To get R2[a;, a,+h]f and compare it with 
R,[a;, a; + h] f we thus need to compute only two new values f(a; + h/4) 
and f(a; + 3h/4). We then check to see if 


|Ra[a;, a; + h]f — Ryla;, a; + h)f| <e(r) (4.11-9) 


where a choice for ¢(r), described below, is given by (4.11-15). If so, we say 
that this interval has converged, we add a;,, = a; + h to the list of accepted 
points, we add R.[a;, a; 1] f to the left-hand side of (4.11-8) to get an approxi- 
mation to /%'+1(f), and we go on to the next subinterval to the right of a;, . 

If (4.11-9) does not hold, we subdivide [a;, a; ,] into two subintervals 
(a;, a; + h/2], [a; + h/2, a; + h], on level r+ 1, and start working on the 
interval [a;, a; + h/2], saving all previously computed values for future use. 
Thus we proceed from left to right and store for subsequent attention var- 
ious intervals to the right of [a;, a; + h] at various levels between 1 and r. We 
finish when a subinterval [a,_ ,, b] converges. 

Now we consider the choice of ¢(r). If we can choose ¢(r) such that 


Ia*(f) — Rola: ail | <5 (4.11-10) 
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where the subinterval [a;, a;, ,] is at level r, then it is easy to show {54} that 
I,.(f ) as given by (4.11-4) satisfies (4.11-2). To determine an appropriate ¢(r), 
we consider the error in Simpson’s rule on [, « + h]: 


IMP) — Rifas a+ AF = — sof UE) = ~ gof(#+ 3) + 000) 
(4.11-11) 


provided f(x) has a continuous fifth derivative on [«, « + h] and where 
a<é€,<a+h. The error using the five-point rule is 


hd 
ef) — Rola, a+ Alf = — es lEs) 
_ h° 4 h 6 
=— 1440" (a +5) + O(h*) 
a<t,<ath (4.11-12) 


If we neglect the terms in h®, then 


R, la, 0 + Alf — Rilo, 0 + hf = ( h 50)" (2 + 5] 


1440 90 
15h° | oh 
—~ _ 22" pea) 0 . 
al («+5] (4.11-13) 
so that |f — R2| =7s5|R.— Ri | (4.11-14) 
Therefore, if |R, —R,| < 15¢/2’, then {J — R,| < /2’ and so we choose 
e(r) = = (4.11-15) 


The neglect of the O(h°) terms is compensated for by the fact that we are 
working with magnitudes of errors [cf. (4.11-10)], whereas the errors on the 
various subintervals are usually of both signs. Therefore there tends to be 
some cancellation so that the actual error neglecting O(h°) terms is usually 
much less than the tolerance ¢. For further refinements of this particular 
scheme, see the references in the Bibliographic Notes. 

If we analyze this scheme, we see that its components are a pair of 
quadrature rules with error estimates containing the same power of h and a 
way of choosing which subinterval to tackle next, both in case of success, i.e., 
convergence, and failure, i.e., no convergence. In the various adaptive quad- 
rature schemes in existence, other pairs of rules or, more generally, se- 
quences of rules are used, other methods of choosing the next subinterval are 
considered, and subdivisions other than bisection are proposed. For exam- 
ple, another choice for the next subinterval is that one on which the 
estimated error is maximal. It has been estimated that millions of reasonable 
adaptive-integration schemes can be constructed, each one having some 
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claim to superior performance. However, any one of the routines currently 
available should do the trick for most of the problems arising in practice. 

Of course, the scheme described in this section is not foolproof. Thus, if 
one knows how the routine works, one can cook up a function which will 
“beat” the routine by having it vanish at a sufficient number of appropriate 
points. This in turn can be countered by making a random division of the 
initial interval into two subintervals and applying the adaptive scheme 
separately on each of these subintervals. In general, as in so many problems 
in numerical analysis, there is a tradeoff between efficiency and robustness, 
where the latter is a quality of an algorithm which in almost every case either 
gives the correct answer within the accuracy desired or exits with an 
indication that it cannot achieve this accuracy. Only in very rare cases, 
therefore, will a robust algorithm give an incorrect answer and claim it to be 
correct. For one-shot problems, robustness is almost always more important 
than efficiency. But when a series of integrals is to be computed, efficiency 
becomes important. The user must then decide whether a saving of time is 
worth the risk of possibly incorrect results. If the integrands are well-behaved, 
this risk will be very slight. Only in the case of difficult integrands does it 
become difficult to decide. 


Example 4.9 Use adaptive integration to calculate 
[et dx =e! — 1 = 22,025.46579 


with an error of less than .001. 
Using the error criterion (4.11-15) and using R, and R, as defined in (4.11-5) and 
(4.11-7), we require 47 abscissas to calculate the integral. These are 


0 to 34 at a spacing of 3 

34 to 64 at a spacing of 7% 
64 to 843 at a spacing of 3 
822 to 10 at a spacing of & 


The calculated integral is 22,025.46607 


which differs from the desired value by .00028. If, using (4.11-14), the error is estimated at 
each stage and these errors are accumulated, the estimated error in the result is — .00028 


which, added to the calculated result, would give full accuracy to the 10 significant digits 
carried in the calculation. 


4.12 CHOOSING A QUADRATURE FORMULA 


If we have no more than a few integrals to compute and the integrand can be 
evaluated at any point, the use of an adaptive-quadrature routine is in- 
dicated and any fully tested one will do. If the above situation does not hold, 
we must make some decisions about choosing a quadrature rule. 
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Table 4.7 Error-term coefficients in Newton-Cotes 
closed and open formulas 


Error coefficient 


Number of abscissas Closed Open 
2 —1/12 1/36 
3 — 1/2880 7/23,040 
4 — 1/6480 19/90,000 
5 — 1/1,935,360 41/39,191,040 


Consider first the problem of tabulating the integral of a tabulated 
function. Because the data are available at equal intervals, we would almost 
certainly choose a Newton-Cotes formula in order to avoid interpolation. If 
derivatives of the tabulated function can be evaluated, the Euler-Maclaurin 
formula of the following section can be used with good results. It gives very 
good accuracy with only a small additional computational effort, namely the 
computation of derivatives of odd order at the endpoints of the integration 
interval. If the derivatives can only be bounded, and if the derivatives do not 
grow too rapidly, we can choose n, the order of the Newton-Cotes formula, 
so that the error bound assures us of the accuracy we desire. 

We can, of course, use open or closed Newton-Cotes formulas. Tables 
4.5 and 4.6 might lead us to believe that, for the same number of abscissas, 
the closed-type formula is clearly superior. Consider the case where the 
number of abscissas is 3. Using n = 2 in Table 4.5, the error is —ggh°f'"(n) 
and using n= 4 in Table 4.6, the error is 4¢h°f'’(y). But note that h is 
(b — a)/2 in the first case and h is (b — a)/4 in the second. So, in fact, the 
errors are, respectively, —[(b — a)°/2880] f'"(n) and [7(b — a)°/23,040] f'*(n), 
and the open formula is slightly better. In Table 4.7 the coefficients of the 
corresponding error terms, with h replaced by (b — a)/n, are given for n 
equals 2 to 5. For 2 and 3, the open formula is slightly better, but for 4 and 5 
the closed formula is better; for n > 5 the advantage of the closed formula 
becomes more marked. 

If we cannot bound the derivatives, then no matter what order Newton- 
Cotes formula we use, we have no assurance that we shall achieve the desired 
accuracy. Here the safest technique is to use a composite formula and in- 
crease the number of subintervals until the number of decimal places of 
accuracy that we desire has stabilized, i.e., remains unchanged, in two suc- 
cessive approximations. 


Example 4.10 Use the trapezoidal rule to evaluate the integral of Example 4.3 with an 
error of less than 4 in the fourth decimal place. 
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In Example 4.8 we used the trapezoidal rule with m = 2 and 4 and got the results 


1.166667 
1.116667 


m= 2: 
m= 4: 


In order to use those abscissas at which f(x) has already been calculated, we continue to 
double the number of subintervals. The results are 


m= 8: 1.103211 
m= 16: 1.099768 
m= 32: 1.098902 
m= 64: 1.098685 
m= 128: 1.098630 


At this point the third decimal place has surely stabilized and the fourth appears to be in 
error by no more than 4 unit. Applying (4.10-15) to the results for m = 64 and m = 128, we 
get | = 1.098612 which is, in fact, accurate to six decimal places. Applying Romberg 


integration to this problem, we get for the table (4.10-25) 


1.333333 
1.166667 
1.116667 
1.103211 
1.099768 
1.098902 
1.098685 
1.098630 


Lili. 
1.100000 
1.098726 
1.098620 
1.098613 
1.098613 
1.098612 


1.099259 
1.098641 
1.098613 
1.098613 
1.098613 
1.098612 


1.098631 
1.098613 
1.098613 
1.098613 
1.098612 


so that, except for round-off error, convergence is achieved after the use of only 16 
subintervals (fifth line of table). 


Of course, higher-order composite rules than the trapezoidal rule can 
also be used. The parabolic rule, for example, ultimately converges more 
rapidly than the trapezoidal rule because of the 1/m* term in (4.10-14) in 
contrast to the 1/m? term in (4.10-12). Because of the high-order derivative in 
the error term, higher-order composite rules, i.e., composite rules derived 
from higher-order Newton-Cotes formulas, are seldom used. Indeed our 
general conclusion is that the simplicity and rapid convergence of Romberg 
integration make the trapezoidal rule the best Newton-Cotes formula to 
use for most problems. 

It is important to realize that because of the results of our discussion in 
Sec. 4.9 on the growth of higher-order derivatives, it is not true in general 
that any desired degree of accuracy can be achieved by an arbitrarily high- 
order formula. To illustrate this, following Hildebrand (1974), let us consider 


4 
X= 2 tan-! 4 = 2.6516353 

Jgl+x 
In Table 4.8 the results of evaluating this integral using varying numbers of 
abscissas and various closed Newton-Cotes formulas are shown. It is rea- 
sonably clear that the Newton-Cotes formulas of high order are not converg- 
ing. In contrast, the parabolic rule is oscillating about the true solution with 
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4 
Table 4.8 Evaluation of | — by Newton-Cotes techniques 
—4 
Number of 
abscissas Newton-Cotes Parabolic Trapezoidal Romberg 
n formula of order n rule rule integration 
3 5.490 5.490 4.235 5.490 
5 2.278 2.478 2.918 2.278 
7 3.329 2.908 2.701 
9 1.941 2.573 2.659 2.584 
11 3.596 2.695 2.6511 
17 2.6477 2.6505 2.6542 
33 2.651627 2.65135 2.65186 
65 2.65 16353 2.65156 2.651631 
129 2.6516353 2.651617 2.65 16353 


decreasing amplitude, and the trapezoidal rule is converging monotonically 
to the true result. Both the parabolic and trapezoidal rules must converge to 
the true solution with increasing n, and eventually, the parabolic rule must, 
as it does, give better results than the trapezoidal rule (why?). For purposes 
of comparison, Table 4.8 also contains the results using Romberg integra- 
tion. For 17 or more abscissas, Romberg gives better results than the 
trapezoidal rule, and for 129 or more abscissas, it would give better results 
than the parabolic rule except for roundoff. 

The singularity of the integrand at x = +i, which limits the radius of 
convergence of the Taylor-series expansion, is the cause of the difficulty with 
the higher-order Newton-Cotes formulas. But the reader will agree that the 
example chosen is not so contrived that we would expect occurrence of this 
phenomenon to be rare. 

If the function whose integral we wish to calculate is not tabulated but is 
given analytically and we must calculate many integrals of the same type, 
1.e., over the same interval and with similar integrands, then we must con- 
sider Gaussian as well as Newton-Cotes formulas for use on an automatic 
computer. If a single quadrature formula—perhaps of high order—is to be 
used, the case for using some Gaussian formula is very good. The reason 
for this is that the Gaussian formulas achieve higher accuracy with the use 
of fewer abscissas than Newton-Cotes methods. This is illustrated in Table 
4.9 for the Gauss-Legendre formula. The integral in all cases is considered 
to be over the interval [—1, 1]. The order of accuracy n is the highest- 
degree polynomial integrated exactly by the method, and the error coefficient 
is the coefficient of f"*'(y) in the error term. Thus not only do the 
Gaussian formulas require calculation of fewer values of f(x) but also they 
have slightly more favorable error terms. In this case then, the general 
superiority of the Gaussian formulas 1s clear. 
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Table 4.9 Comparison of accuracy of Gaussian and Newton-Cotes 
(closed) formulas 


Gauss-Legendre formula Newton-Cotes formula 

Order of a 
accuracy Number of Error Number of Error 

n abscissas coefficient abscissas coefficient 

3 2 1/135 3 — 1/90 

5 3 1/15,7S0 5 — 1/15,120 

7 4 1/3,472,875 7 — 1/3,06 1,800 

9 5 8.08 x 10° 1° 9 —1.21 x 107? 


Furthermore, in contrast to the Newton-Cotes case, if f(x) is continuous 
and w(x) is any integrable, nonnegative weight function, then any desired 
degree of accuracy is obtainable by using a sufficiently high-order Gaussian 
quadrature formula. To see this let « > 0 be given and let p(x) be a polyno- 
mial, say of degree N, such that | f(x) — p(x)| < «on [a, b]. The existence of 
such a polynomial is guaranteed by the Weierstrass approximation theorem 
(page 36). Now, using the notation of (4.5-1), 


b 


[ w(x) F¢e) dx — YH, fla) 


|E| = 


< 


f WOLF) ~ ple) ax 


+ | w(x)p(x) dx — LH jP(a;) 
+ ¥ Hilpla,) -f (a;)] (4.12-1) 


If 2n — 1 > N, the second term on the right is zero (why?). Then, since the 
sum of the weights in any Gaussian quadrature formula is the integral of the 
weight function over the interval [to see this let f(x) = 1 in (4.5-1)], and since 
all the Gaussian weights are positive, we have 
b b b 
|JE| <e | w(x) dx + € | w(x) dx = 2€ | w(x) dx (4.12-2) 
a a a 
which proves the assertion that the error can be made arbitrarily small. 
Therefore, using a sequence of Gaussian formulas with increasing n will lead 
to a convergent sequence of approximations. However, because it is incon- 
venient to have to store a sequence of Gaussian weights and abscissas in a 
computer,t because it 1s difficult to estimate the magnitude of the error term 


+ It is also possible to generate them internally at a cost of O(n?) operations. 
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for all but quite small n, and because of the efficiency of using composite 
rules, this approach is less used in practice than it should be. Note that this 
proof of convergence is valid for any sequence of quadrature formulas of 
increasing order provided only that the weights are all positive, a condition 
which is not satisfied by Newton-Cotes formulas. But this condition is 
satisfied by Romberg integration [cf. (4.10-29)], and therefore the above 
argument also provides a proof of the convergence of the Romberg technique. 

Thus, when there are many similar integrals to calculate, our strategy 
would be to do first an exploratory computation on a few integrals to 
determine the order of the Gaussian formula which gives the desired accur- 
acy and then to use this formula for the entire series of integrals. If one wants 
to check the accuracy, one can do so, for example, with a second Gaussian 
formula of higher order. 

We should also mention the possibility of transforming the integrand by 
a change of the independent variable. This can sometimes be used to elimin- 
ate the singularity in an improper integral. For example, if f(x) is continuous 
in [0, 1], the change of variable t” = x transforms the integral 


| x" "f(x)dx n>2 (4.12-3) 


into n : f(t"? dt (4.12-4) 


“0 
which is a proper integral. A more generally applicable transformation 1s 


b g(b) 


| f(x) dx=|  flalt))g'(e) at (4.12-5) 


“a g(a) 


where g(t) is a continuously differentiable function whose derivative does 
not vanish on [a, b]. By a proper choice of the function g(t) it may be 
possible to evaluate the transformed integral more efficiently than the orig- 
inal one. 

In the present chapter we have covered only the basic topics in numeri- 
cal quadrature. Thus we have not mentioned the difficult problem of integra- 
tion of highly oscillatory integrands and have only briefly touched upon 
infinite intervals and integrands with singularities. Nor have we mentioned 
multiple integrals. Finally we have not gone deeply into the important 
question of error analysis. Let us only point out that the Peano theorem 
(Sec. 2.3) has proved very useful for estimating the errors of quadrature 
formulas. Details of the above topics and of others can be found in the 
literature cited in the Bibliographic Notes. 
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4.13 SUMMATION 


Our interest in this section is in approximating the sum 


¥ Seo +h) (4.13-1) 


where n may be infinite. As a by-product of the development of methods to 
do this, we shall arrive at another and useful technique for numerical 
quadrature. 


4.13-1 The Euler-Maclaurin Sum Formula 


We begin by introducing the Bernoulli polynomials B,(x), defined to be the 
polynomials of degree k which are the coefficients of r“/k! in the expansion 


— 5" B,(x) 7 (4.13-2) 
from which it follows that 

B,(0) = 0 k>0 

B,(1) = 0 k#1 


By expanding the left-hand side of (4.13-2) in a power series we can compute 
the first few Bernoulli polynomials as 


By(x) = 0 B,(x) = x 


(4.13-3) 


2 4.13-4 
B,(x)=x?-—x = B,(x)=x3 - = + 3x \ 


The expansion of the left-hand side of (4.13-2) with the e** — 1 term deleted 
can be written 


t ok 
= DRG (4.13-5) 


e’— 1 k 


where the constants B, are the Bernoulli numbers.t Of the many identities 
involving Bernoulli polynomials and numbers, the following three are of 
particular interest to us: 


Bors = 0 k > 0 (4.13-6) 
B,,(x) = 2kB3,- 1(X) k > 1 (4.13-7) 
Boy +1 (x) = (2k + 1)[By,(x)+ By] k 20 (4.13-8) 


+ The Bernoulli polynomials and numbers are defined somewhat differently by different 
authors. Our notation follows generally that of Bromwich (1947). See Steffensen (1950) for a 
somewhat different notation. These polynomials should not be confused with the Bernstein 
polynomials (p. 36). 
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We leave the derivations to a problem {57}. The first few nonzero Bernoulli 
numbers are Bp = 1, B, = —4, B,=% By = —3, Bg =a, Bs = —3; 
Bio = Bi2= — 2730, Bi,z= é Big = — “S105 Big = 43,867/798, 
B59 = —174,611/330. Now we define 


1 h 
X, = (2k)! I B( 7] f(x + y) dy (4.13-9) 


where f(x) is the function of (4.13-1). Integrating by parts twice and using 

(4.13-3), (4.13-7), and (4.13-8), we get 

= RS Me +h) FH MY] >I 
(4.13-10) 

When k = 1 we have, also by integration by parts, 


X,= ; [ B(2) st + y)dy= ; [ (}: - Ae + y) dy 


-[ (-g)retne 


- - (et h) + f(x)] + af fe + y) dy 


xt+h 


= = 5, L/« +h) + S00) +52 [ FO) dy (4.13-11) 


Using (4.13-10) and (4.13-11), we have 


iLflx +m) +s) =} { 70) dy — Ax, 


x 


xt+h 


[ £0) dy + S2UP Ge +h) SQ) — WX 


l 
; 
a oeatyee i) 


“x 


mx +h) —f"(x)] — EX; 


= [ f(y) dy + x omy WFO Ox +h) — FE MX) 


x 


—prmtiy (4.13-12) 
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We define 


B,(x) O<x<] 
~ |B,(x— 1) otherwise 


(4.13-13) 


Now consider (4.13-9) with the limits going from jh to (j + 1)h and with 
B,,(y/h) replaced by B,(y/h). We can repeat the derivation of (4.13-12) for 
j=1,...,n— 1. Then summing the results, replacing x by x9, and noting 
that the series on the right-hand side of (4.13-12) telescopes, we get 


x [f%" D(xq +-nh) — f'*" (xo)] + En (4.13-14) 
where, using (4.13-9), 


p2m+ 1 nh 


5 y 

FE —_— —_ J (2m+ 2) 13- 
By using the properties of the Bernoulli polynomials {59} (4.13-15) can be 
simplified to 


_ nh?™"* 2B, > 
m (2m + 2)! 


where Xo < & <Xq + mh. Equation (4.13-14) is the Euler-Maclaurin sum 
formula. It is useful even when the upper limit in the summation is infinite, 
although in this case the error can no longer be expressed in the form 
(4.13-16). A useful error estimate in the infinite case (and when n is finite) is 
that the error is less than the magnitude of the first neglected term in the 
summation on the right-hand side of (4.13-14) if f°?"* (x) and f(?"* (x) do 
not change sign and are of the same sign for xg < x < Xo + mh {60}. If just 
f(?"* (x) does not change sign in the interval, the error is less than twice the 
first neglected term {61}. 


fmt 2) (4.13-16) 


Example 4.11 Use (4.13-14) to approximate 


We use f(x) = 1/x? with x9 =h= 1. The true sum is 27/6 ~ 1.644934067. Using 
(4.13-14), our approximation for the sum is 
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The following table indicates the error between the true result and the approximation as a 
function of m: 


m Error m Error 

0 145 6 .198 
1 — 022 7 — .968 
2 012 8 6.124 
3 — 012 9 — 48.847 
4 021 10 480.277 
5 — .055 


The approximations, after initially converging, are clearly diverging. Suppose now we try 
to approximate 


Then we have from (4.13-14), now with x, = 10, 


2 4 1 11 ™ B, 
—- e— +-—— + 
» (10+)? 10 210? 2 102«*1 


j 
With m = 2 we get 
1 1 1 | 1 1 


ty ttt Ut Tn 1951663333 
10 * 200 * 6(10)* — 3010) 


Added to the directly calculated sum 
9 
» 

jal 


this gives 1.6449340644, which is correct to almost nine decimals, a remarkable change 
from the previous case. To account for this, we note that all derivatives of f(x) are of 
constant sign in (0, oo), so that the error is less than the first neglected term. In the first 
case this is B,,,,, and in the second it is B,,,, ,/10?"* >. Now it is known [see Steffensen 
(1950)] that ultimately the Bernoulli numbers B,;, grow as (2i)!, so that 
Bom+2/X3"*3 + 0c asm-— oo for any xy. Thus the Euler-Maclaurin sum formula diverges 
for f(x) = 1/(x9 + x)? for all xq. In particular, it diverges for both the above cases. But the 
series on the right-hand side of (4.13-14), although divergent when f(x) = 1/(x9 + x)?, is 
asymptotically convergent; ie., it converges for a while before it starts to diverge. When 
Xo = 1, the convergence never gets very far, but for x)»=10, when m= 2, 
Bm+2/10?"*> = B,/10’ = 2.4 x 107°, so that the asymptotic convergence is very good. 
To get similar accuracy by direct summation of the series, we would need to sum over 10° 
terms! 


1 
7p’ 


= 1.5397677311, 


This example illustrates the power of the Euler-Maclaurin formula if 
used judiciously. It is worth remarking that judiciousness is necessary in 
dealing with most asymptotic series. One exception to this rule is interpola- 
tion series (see Prob. 19, Chap. 3), which are in fact asymptotic series. In 
practical application of interpolation formulas we are virtually always in the 
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region of convergence of the interpolation series. 
Rearranging terms makes (4.13-14) a quadrature formula 


XO 


xo+nh 


Fly) dy = bY. flso +h) — FL vo + mh) + feo) 


m 
B», 2k 


~ 2 Gh) 
x [fP- Y(xg + nh) — f%" Y(x0)] — KE, (413-17) 


The first two terms on the right-hand side of (4.13-17) are just the 
trapezoidal-rule approximation to the integral. Equation (4.13-17) can then 
be looked at as the trapezoidal rule with correction terms. If we use 
Eq. (4.1-7) to approximate f'?*~ '(x9) in terms of forward differences, and if 
we use the analog of (4.1-7) with backward differences of f(x + nh) [found 
by differentiating (3.3-15)] to approximate f'7*~ (x9 + nh), the result is Greg- 
ory’s formula 


xot+nh 


{  fQ) dy=hGfo + fit + Sa-1 + 4h) 


xO 


h h 5 > 
+ 75 (Afo — Vin) — 54 (A’fo + Vn) 


+ aon (Ay — V3) — aan (Af + V4f,) +20 (413-18) 


where we have written f, for f(x9 + jh). This formula is one of the few 
examples where it is convenient to use differences on a digital computer. For 
if we use (4.13-17), we must program the computer to calculate not only f(x) 
but also m derivatives. Moreover, we may not know how large m need be a 
priori. But we can add terms to (4.13-18) merely by computing higher differ- 
ences, and these differences will generally be much easier to compute than 
the derivatives. 


Example 4.12 Use (4.13-18) to improve the result of Example 4.10 for 16 subintervals. 
Using a difference table for 1/x, we calculate with ag = 1 andh=4 


Afy = -3 Vfie = — 5 
and the first difference correction term in (4.13-18) is 
3s(45 — 3) = —.001006 


which, added to 1.099768, the result for 16 subintervals, gives 1.098762. Similarly, we 
calculate 


zr ye 
A*fo=ds  VWhe=rhs 
so that the second difference correction term is 


~zha(ds + 45) = —.000123 
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which, added to the above result, gives 1.098639. These corrections give successively better 
results than the trapezoidal rule. Care must be taken not to carry the correction process 
too far because Gregory’s formula is also only asymptotically convergent in general. 
Moreover, of course, roundoff becomes a factor with higher differences. In {63} a direct 
connection between Gregory’s formula and higher-order Newton-Cotes formulas is 
considered. 


4.13-2 Summation of Rational Functions; Factorial Functions 


In this section we confine ourselves to the case where f(x) in (4.13-1) is a 
rational function. In Example 4.11 we applied the Euler-Maclaurin sum 
formula to such a function. Our technique there—as it will be with all slowly 
convergent series whose sum cannot be expressed in closed form—was to 
replace the sum of the slowly convergent series by the sum of a rapidly 
convergent series. In this section, we first consider a class of rational func- 
tions whose sum can be expressed in closed form and then use this class to 
convert series of rational functions to more rapidly convergent ones. 

Our basic tool here will be the sequence of factorial functions denoted by 
x™ and defined by 


x™ = x(x — 1)(x — 2) ++: (x —n+4+ 1) 
na positive integer (4.13-19) 
x = I 


so that n™ =n!. 
Consider first the forward difference of x™ with unit spacing: 


Ax™ = (x + 1) -— x™ 
= x(x — 1)-+: (x —n+2)[(x + 1)-— (x —n+1)J=nx"" (4.13-20) 


Therefore, the factorial functions play the same role with respect to differ- 
ences that the powers of x do with respect to differentiation. Thus we have 


N-1 N-1 1) N@*l) _ geet) 
 ) A nt => OO 4.13-21 
a nat Ae n+1 ( ) 
N-1 
since S” Af (x) =f(N) —f(M) (4.13-22) 
x=M 
for any f(x). When n is negative, we define x” by 
1 
x”) =________ na negative integer (4.13-23) 
(x — n)~” 


For this case, too, we can show that (4.13-20) holds and, when n # — 1, that 
(4.13-21) also holds. Note, in particular, that 0 + 0 when n is a negative 
integer (why?). 

We shall now use (4.13-21) to convert slowly convergent infinite series of 
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rational functions to more rapidly convergent ones. Note that in such infinite 
series the degree of the denominator must exceed that of the numerator by 
2 or more (why ?). The basic idea of this method, which is known as Kummer’s 
method, bears an analogy to the comparison test used to determine con- 
vergence or divergence of infinite series. Suppose the series we wish to sum is 


YY" R(x) (4.13-24) 
where R(x) is a rational function. If there is a series 
= > R(x) (4.13-25) 


whose sum we know and such that R(x) and R(x) have the same difference in 
degree d between numerator and denominator, then we can use S to convert 
S to a more rapidly convergent series. This technique is most easily il- 
lustrated by an example. 


Example 4.13 Evaluate 


I I (-3 
et R= Goes) ee He THOT 


Then, from (4.13-21), 5 = 4 {68}. Now 
x? +1 
S=$5+(S-—S)=- —_,—— 
( 4 i Lert 44 1)x(x? — 1) 
which converges as 1/x° and therefore much more rapidly than (4.13-24). We could also 
have used 


1 (<3 
“xa ee OI 


In this case 5 = 7; and 


_! _y 1 — 2x? — 3x? 
12. F x(x + 1)(x + 2)(x* + 1) 


which converges as 1/x*. 


The general result is that if R(x) and R(x) both have a denominator 
degree which exceeds the numerator degree by d, and if 

lim x*R(x) = lim x*R(x) (4.13-26) 

then R(x) — R(x) has a difference in degree between numerator and deno- 

minator of at least d + 1 {69}. In the above example we saw how a judicious 

choice of R(x) made the resulting difference d + 2. The procedure used in the 

above example can be applied again and again to make the summation even 
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more rapidly convergent, but the algebra usually becomes tedious quite 
rapidly. 


4,.13-3 The Euler Transformation 


The Euler transformation is a very useful device for accelerating the conver- 
gence of an oscillating series. We shall derive here a generalization useful in 
summing a finite oscillating sum which reduces to the Euler transformation 
in the case of an infinite series. Let 


S = Uo — v, +v,—°7°+(-—1)"0, (4.13-27) 
where the v; are usually, but not necessarily, all positive. Write 


S(x) = v9 — 0, X + 02x? —+++ + (—1)"0,x" (4.13-28) 


Then 
(1 + x)S(x) = v9 — (01 — Uo)x + (v2 -— v,)x? _" 
+ (- 1)"(v, — V_- 1 )x" + (— 1)"v,x"*! 


= U9 — (Avo)x + (Av, )x? — ++» + (—1)"(Av,- 1)" 
+ (—1)"v,x"*! (4.13-29) 


From (4.13-29) we obtain 


_ U9 + (—1)"0,x"*! 


S(x) = 1+x 
— ylAvg — (Av, )x + (Av. )x? — +++ + (= 1) *(Av,— 1x] 
(4.13-30) 
where y= * (4.13-31) 
1+x 


Applying this transformation again to the bracketed series in (4.13-30), we 
obtain 
_ vo t+ (—1)"v,x"7! Avg + (— 17" "(Av,,— 1)x" 
S{x) = l+x 1+x y 
+ y*[A?v9 — (A2v,)x + (A?v.)x? — +++ + (—1)"" 7(A?0,— 2)x"” 7] 


(4.13-32) 
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If p < n, then p applications yield 


—vA 2 A2 mee —_1\P—1,,p-1 Ap-1 
S(x) = 2 y Avo + y =i Saat 1)P~ ty?" AP~ "v9 
4 (EDon ett! + (dog ety oo + (AP og ptt 
1+x 
+ (—1)Py"[A?ug — (A?v,)x + (A?v,)x? — ++: + (— 1)"~ ?(A?v,,- ,)x" ?] 


Set x = 1, and we obtain the identity (4.13-33) 


= 4y, — 4 Avo + 4 A*v —ite + (—1)P-'27? A?~ v5 
+ (—1)"(4v, + 4 Av,—; + A?0,-2 +775 +27? AP hd, 541) 
+ 27 ?(—1)[A?vg — A?v, + APv, —--: + (—1)" ? Ao, _ ,] p<n 


(4.13-34) 
Assuming now that n and p are large and that the high-order differences 
are small, we neglect the last bracket and obtain 
S = 49 — 4 Avg + $ A?v9 — -*- 


+ (—1)"fdo, + 4 Av, + 4 A%r,-2 + °°] (4.13-35) 
Equation (4.13-35) is useful for summing a finite oscillating sum and has 
been applied in evaluating the integral over a finite interval of a rapidly 
oscillating function. 
If we now let n go to infinity, so that we have an infinite series 


S = y (— 1)», (4.13-36) 
j=0 
which, for now, we assume to be convergent, then the terms in the brackets 
in (4.13-35) all tend to zero for a fixed p and (4.13-36) becomes 
S ~ 409 — § Aug + § A*ug — +++ + (— 1)? 127? AP 109 (4.13-37) 
Letting p go to infinity, we get the Euler transformation: 


d (= 10; = 409 — 4 Avg + § A? 09 — (4.13-38) 
j=0 
A more flexible form of this transformation involves choosing an approxi- 
mate index m from which to start, so that we have 


m-1 


¥ (=1)o,= ¥ (= 1Po, 


j=0 j=0 


+ (—1)"(40_ — 4 Av, + $ A?0,—71') (413-39) 


Example 4.14 Let S = )'%. (— 1}/[log (j + 2)]. S converges very slowly. Apply (4.13-39) 
with m = 2. We have 


v, =.721348 Av, = —.100013  A’v, = .036788 
A?v, = —.01778 Atv, = .00998 ASv, = —.00617 
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Therefore, to four figures 
S = (1.442695 — .910239) + 4(.721348) + 4(.100013) 
+ 4(.036788) + ;4(.01778) + 3(.00998) 
+ ¢4(.00617) + --- = .9243 


Had we started with m=0, we would have required 11 terms to achieve the same 
accuracy. 


The Euler transformation is a sequence-to-sequence transformation 
which takes partial sums of one series into partial sums of another. It can be 
shown that if the original series converges, so does the transformed one, and 
in the case of oscillating series, the transformed series usually converges 
more rapidly. On the other hand, there exist many divergent series which are 
transformed into convergent series. In such cases, the sum of the trans- 
formed series can be thought to represent in some way the “average value” 
of the partial sums of the divergent series. The Euler transformation has 
been applied to the evaluation of integrals over infinite intervals of rapidly 
oscillating functions. 

We have at best touched the surface of the existing mine of methods to 
sum series. References to others will be found in the Bibliographic Notes. 


BIBLIOGRAPHIC NOTES 


Sections 4.1 to 4.2 Numerical differentiation is discussed in most of the standard numeri- 
cal analysis texts. In particular, Kopal (1961) and Hildebrand (1974) have extensive discussions. 
In these two books, as well as in Steffensen (1950), the error term is derived using divided 
differences. The proof of formula (4.1-3) appears in Isaacson and Keller (1966). Coefficients for 
the Lagrangian differentiation formulas based on differences have been extensively tabulated. 
Some tables and references to others are given by Kopal (1961). An example of analytic 
differentiation by the computer appears in Rall (1969). Following the original paper by 
Richardson and Gaunt (1927), many authors have discussed Richardson extrapolation; see 
the excellent survey by Joyce (1971). The application of Richardson extrapolation to numerical 
differentiation was suggested by Rutishauser (1963), from whom Example 4.2 is taken. 


Sections 4.3 to 4.12 The most comprehensive treatment of numerical quadrature appears 
in Davis and Rabinowitz (1975), which contains an extensive bibliography and a collection of 
computer programs. Krylov (1962), which is an excellent book but slightly dated, has discus- 
sions of much of the material of this chapter. For the insight provided by other points of view 
and for some topics not covered here, see Kopal (1961), Hartree (1958), and Mineur (1952), as 
well as Krylov. Stroud (1961) gives an extensive bibliography on quadrature methods, which 
has been brought up to date by Davis and Rabinowitz. 


Sections 4.4 to 4.8 More complete accounts of the properties of the Gaussian weights and 
abscissas will be found in Winston (1934). The most complete reference available on orthogonal 
polynomials is Szeg6 (1975), but most of the results we have used here are available in a more 
accessible form in Jackson (1941). Much of the material in Sec. 4.8-3 is from Mineur (1952). 

An extensive collection of tables of abscissas and weights for Gaussian quadrature for- 
mulas is contained in Stroud and Secrest (1966). 
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Sections 4.9 and 4.10 The error term in the Newton-Cotes quadrature formulas has been 
considered by a number of authors; see Steffensen (1950), Barrett (1952), and Sard (1948). 
Davis (1959) gives an interesting comparison of the trapezoidal rule and Gaussian formulas in 
the case where we integrate a periodic function over a full period. A good discussion of Romberg’s 
method can be found in Bauer, Rutishauser, and Stiefel (1963). Newton-Cotes formulas with 
weights other than w(x) = 1 have been considered by Kaplan (1952) and Luke (1952). An 
interesting treatment of the weight function sin kx is given by Filon (1928). A quadrature 
formula which is particularly efficient in composite rules is considered by Ralston (1959); 
see also Prob. 53. 


Section 4.11 The adaptive Simpson scheme is treated fully in Lyness (1969). Further 
schemes, examples, computer programs, and references are given in Davis and Rabinowitz 
(1975). The various components that make up an adaptive quadrature routine are given by Rice 
(1975). Different strategies in applying an adaptive-quadrature scheme are compared by Mal- 
colm and Simpson (1975). 


Section 4.12 The proof that arbitrarily high accuracy can be obtained with Gaussian 
quadrature formulas is from Chap. 3 of Todd (1962). 


Section 4.13 Steffensen (1950) discusses the Bernoulli polynomials and numbers in detail, 
but his notation is different from ours. Hamming (1973) considers a number of miscellaneous 
methods of summation; see also Cherry (1950), Rosser (1951), Shanks (1955), and Szasz (1949). 
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PROBLEMS 


Section 4.1 


1 In (4.1-4) let m; =0 and a; =a, + (j — 1)h. Determine the values of the coefficients 
w(x) in the formula 


fO(x) = LY wH(x)F(a;) 
jel 

for x =a,,i=1,...,n, k = 1, 2, and n= 3, 5. 

2 (a) With n = 5 use (4.1-7) to generate approximations to the first four derivatives of 
f(x) at x =a. 

(b) Use the results of Prob. 13, Chap. 3 to do the same thing in terms of backward and 
central differences, retaining differences through order 5. 

3 (a) Using the notation of Prob. 9, Chap. 3, show that the Taylor-series expansion of 
J (x) can be written Ef(x) = ef(x), where D is the operator with the property Df(x) = f'(x). 

(b) From this deduce the relations hD = In E = In (1 + A) = —In (1 — V). 

4 Use a numerical differentiation formula with differences through order 4 to find the 
coefficients of the differential equation y” + ay’ + by = x satisfied by the function f(x) in the 
tabulation below; then determine f(x) itself. 


1.683327 1.841471 1.991207 


5 (a) Find the first derivative of f(x) = 1/(1 +x) at x = .005 by using a Lagrangian 
formula with n = 3. Use equally spaced intervals with the middle point at x = .005 and h = 1.0, 
.1, .01, respectively. Round all values of f(x) to four decimal places. 

(b) Repeat the calculations of part (a) with h = .01 using (4.1-7) and retaining terms 
through the second difference. Compare these results with those of part (a). 


Section 4.2 
*6 Consider Eq. (4.2-6) rewritten in the form 
O(h;) = $,= 9 - > a; hi? 
jel 

The process of Richardson extrapolation can be thought of as that of finding functions 

Pm(h) = bo. m — 2, Bj. mh” 

j=l 

which agree with the truncated series ®,,(h) = @ — }'"_, a,h” at the m + 1 points h,,,,k = 0, 
..., m, and then evaluating these functions at h = 0 (extrapolation) to yield the approximations 
to d, bo. m = T;, for successive values of m. In the case y,; = jy, i,(h) is a polynomial in h’, and 


we can use polynomial interpolation to find Tj. Show that if the 7’, are computed using the 
iterated interpolation scheme of Prob. 23, Chap. 3, we get formula (4.2-10). 
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*7 Consider now the case when y, = jy + 0. 
(a) Using the notation of the previous problem, show that 


Pm(h) = b5, m — HP * Pry 1(h”) 


where P! _, is a polynomial of degree m — 1. 
(b) Let Ei(f) be the value at h = 0 of the polynomial of degree m through the points 
(h?,,.f(hi,,)), k =0, ..., m. Show that E',(f) is a linear functional. 
(c) If we now set gi(h,.,) = 6:4, = O(h;,,), kK =0, ..., m, show that Ei(bp ,,h7°) = 
Ei (h-°®), so that 
_ E(h-*) 
T: eT ra 
"  £,,(A7?) 
and both numerator and denominator can be computed using (4.2-10) with T replaced by E. 
(d) Show finally that 7’, can be computed using the following scheme: 


Ti = >; Ho =h,? 


Tit) _ Ti _ WHit! 
T _ Th + - mim where Di = Sra 
Din —1 his mn 1 
Ay _ Ani 


[Ref. Joyce (1971) p. 474.] 

8 If f(x) is defined and has a Taylor expansion for x e I = [a, b] but is not defined outside 
I, then at x = a (4.2-1) does not hold, so that we cannot use (4.2-3) to approximate f’(a) nor can 
we apply Richardson extrapolation to (4.2-1). 

(a) Derive formulas analogous to (4.2-1) and (4.2-3) for this case using only points in 
[a, b]. 

(b) Apply Richardson extrapolation to compute approximations to the value of the first 
derivative of cos (sin x/./x) at x = O correct to four decimal places, using the two sequences of 
h given in Example 4.2. 


9 Instead of numerically differentiating the tabulation of Prob. 5, approximate the first 
and second derivatives at .6 and .7 using, respectively, the first and second differences of f (x). 
How do these compare with the derivatives calculated in Prob. 4? Show how the error in the 
approximation can be estimated using the (k + 1)st difference. 


Sections 4.3 to 4.5 


10 (a) Let cg, ..., Cc, be a solution of the system of linear equations 


n—-1 
Aran + > Cy 44 = 0 i=0,...,.n-—1 
k=0 


with «, given by (4.4-2). By writing (4.4-1) in the form 


- i+k i=0,....n-—1 
Citk = » Hija; 
j=1 k=0,...,n 
and substituting into the above linear system show that the zeros of the polynomial 
n-l 


x"+ Yi c,x" 


k=0 
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may be used as the a,'s. Then deduce that the weights can be found by solving the linear system 
consisting of the first n equations of (4.4-1). 
(b) Apply this technique to try to find the abscissa and weight of the quadrature formula 


1 
| xf(x) dx = Hy f(a,) 
which is to be exact for 1 and x. Here a, = f' , x**' dx. How do you explain the result? Use the 
result to state an assumption which is implicit in the statement of part (a). 

11 By making use of the function f(x) = p,(x)u,-_ ,(x), where u,_,(x) is an arbitrary 
polynomial of degree n — 1 or less, prove that it is necessary for p,(x) in (4.4-7) to be orthogonal 
to all polynomials of lesser degree if the Gaussian quadrature formula is to be exact for 
polynomials of degree 2n — 1 or less. Does this argument have to be modified when there is a 
weight function? 


12 (a) Use Gauss-Legendre quadrature with n = 2, 3, 4, 5 to approximate the integral 


4 od 
es: 


and compare the results with the true value (see also Sec. 4.12). 


-10x 


(b) Repeat part (a) for 4° sin x dx. 


(c) Repeat part (a) for xe dx. 


COS xX 


(e) Repeat part (a) for (i_ xin ri. 


(d) Repeat part (a) for ; a — x*)3/? cos x dx = 3nJ,(1) = 1.08294. 
i = 1J4(1) © 2.40394. 


Section 4.6 


13 Orthogonal polynomials. Let {f,(x)} be a sequence of polynomials such that ¢,(x) is of 
degree r and is orthogonal to all polynomials of lesser degree over the interval [a, b] with 
yespect to the weight function w(x). In this and the following problems, we assume that w(x) is 
of constant sign in [a, 5]. 

(a) By letting 


d’U,(x) 
w(x)$,(x) = — 
x 
show that U,(x) must satisfy the equation 
[Ur "de ~ Urq) _ + Ur” a i 1° + (— 1\"~ Ugh =0 


where q,-_ ,(x) is any polynomial of degree r — 1 or less, superscripts denote differentiation, and 
the notation }® means that the value of the expression in brackets at x = a is to be subtracted 
from that at x = b. 

(b) Show that the boundary conditions 


U,(a) = U;(a) = Us(a) = = US Ma) =0 
U,(b) = U;,(b) = Us(b) = = US (b) = 0 
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satisfy the requirements of part (a) and serve to determine ¢,(x) uniquely. 
(c) Finally deduce that 


o,(x) = —— Ue) 


~ w(x) dx" 


where U,(x) satisfies the differential equation 


q+} | i PH) 9 


dx’*! lwlx) 


w(x) dx’ 7 


subject to the boundary conditions of part (b). Such a representation of an orthogonal polyno- 
mial is called a Rodrigues formula. 


14 (a) By writing 
(x) = A,X! + G,—1(x) 
show that y= [so e)2¢x) dx = A, [ x(x)6,0) dx 


(b) With U,(x) defined as in Prob. 13a, use the results of Prob. 13b to show that 


y= (=r A, | Uylx) dx 


a 


15 Prove that if w(x) does not change sign in [a, b], the zeros of ¢,(x) are real, distinct, and 
lie within [a, 5]. 


16 Recurrence relations. (a) Show that 
Pr+ 1 (x) ~~ a, xp,(x) = b,@,(x) + b, 1, - 1(x) tot bo Po(x) 


where a, = A,,,/A, and the b,’s are constants. 
(b) Use the orthogonality property of the polynomials to show that b; =0,i=0, .. 
r — 2, and thus deduce that the polynomials satisfy the recurrence relation 


b, + 1(X) = (a,x + b,)b,(x) + b,- 1G, — 1(%) 


°9 


Show that b, = (7 *: 
(c) Show tha ave 4 


where B, is the coefficient of x’~' in @,(x). 


A,4 1 A,_ Yr 


that b,.,=- 
(a) Show tha rH ay, 


by multiplying the recurrence relation in part (b) first by w(x), ,(x) and then by w(x)@,- 1 (x), 
integrating each equation over [a, b], and solving for b,_, in the resulting set of equations. 

*(e) Use the recurrence relation to show that ifr, <r, <--: <r, are the zeros of ¢,(x) and 
S$, <S,<°'' <5,,, are the zeros of d, , ,(x), thena<s, <r, <s,<r,<s,;<°''<s,<r,< 
Sn+1 < 0. Hint: Assume that A, > 0 for all r. 


17 The Christoffel-Darboux identity. (a) Divide the recurrence relation of part (b) of 
Prob. 16 by a, y,, multiply the result by ¢,(y), subtract from this the same equation with x and y 
interchanged, and use part (d) of Prob. 16 to obtain 


(x _— y) Ma)OLy) - Drs 1 (x)@,(y) ~~ ,(x)O, + i(y) _ (x), — 1(¥) — b,—1(*)0-(y) 
Yr Xr Pr 1 Vr-1 


(b) Sum this result from r = 0 tor = nto obtain the Christoffel-Darboux identity (4.6-3). 
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18 (a) When [a, b] = [0, 1] and w(x) = x* with k a positive integer, derive U,(x) and from 
this an expression for @,(x) under the condition ¢,(0) = 1. 

(b) Use this result to calculate y, and A,. 

(c) Thus derive the form of the weights and error in the corresponding Gaussian quadra- 
ture formula as given by (4.6-8) and (4.6-9). 


19 Legendre polynomials. (2) When w(x) = 1 and [a, b] =[-—1, 1], use the results of 
Prob. 13 to show that 


U, = C,(x? — 1/ 
where C, is an arbitrary constant. 
(b) With C, = 1/2’r!, show that 
(2r)! 2 
A. = d = 
“Ore ON oe 


(c) Show that the Legendre polynomials, denoted by P,(x), satisfy the recurrence relation 
2r + 
P,44(x) = red 


(d) Show that P,(x) = 1, P,(x) =x and use part (c) to generate the next six Legendre 
polynomials. 


Section 4.7 


20 Laguerre polynomials. (a) When w(x) = x’e~**, a > 0, B > —1, and (a, b) = (0, 0), 
show that 


U,(x) = C,e7 **x?*" 
(b) With C, = 1, show that 
_ril(r+ B+ 1) 


ai te 


A,=(-lfo° andy, 


(c) When «= 1, B=0, these polynomials are called Laguerre polynomials and are 
denoted by L,(x). Show that they satisfy the recurrence relation 


Lys s(x) = (1+ 2r — x)L,(x) — r?L,_ (x) 


(d) Show that L,(x) = 1, L,(x) = 1 — x, and use part (c) to generate the next six Laguerre 
polynomials. 

(e) When a = | and 8 > —1, the polynomials are called generalized Laguerre polyno- 
mials and are denoted byL*(x). Show that they satisfy the recurrence relation 


LP. (x) = (1 + 2r + B — x)LP(x) — r(r + BLP. 1(x) 
(f) Repeat part (d) for the generalized Laguerre polynomials with L(x) = 1, 
Li(x) = 1+ B- x. 
21 Hermite polynomials. (a) When w(x) = e~**’ and (a, b) = (— 00, 00), show that 
U(x)=C,e?* 


(b) With C, = (—«)~’, show that 


arr! 
A,= (2a) and y,=-—/n 
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(c) When « = 1, these polynomials are called Hermite polynomials and are denoted by 
H,(x). Show that they satisfy the recurrence relation 


H, , (x) = 2xH,(x) — 2rH, _,(x) 


(2) Show that H,(x) = 1, H,(x) = 2x, and use part (c) to generate the next six Hermite 
polynomials. 

22 (a) Use the recurrence relation of part (b) of Prob. 16 to show that (4.6-8) can be 
written 


H.= AY 
, A,-1 P(A;)P_- 1(a;) 


(b) Apply this result to get new expressions for the weights in the Gauss-Legendre, Gauss- 
Laguerre, and Gauss-Hermite quadrature formulas. 
(c) Use the following relations from Szegé (1975): 


(1 — x?)P/(x) = (n + 1)xP,(x) — (n + 1)P,,, (x) 


xLi(x) = (= = 1)Lg(x) + Ly (2) 
H,(x) = 2xH,(x) — Hy+1(x) 


to express the weights in terms of the polynomials of degree n + 1 only. 
23 (a) Use Gauss-Laguerre quadrature to approximate the integral 


| e~ 1°* sin x dx 
0 
using n = 2, 3, 4, 5, and compare the results with the true value. 
(b) Approximate the integral of part (a) by using the result of part (b) of Prob. 12 and 
finding a bound on the integral from 1 to oo. Compare the two results. 
(c) Repeat part (a) for ff e~ */(1 + e” 7*) dx. 
(d) Use Gauss-Hermite quadrature to approximate the integral f°, |x|e~°*’ dx, with 
n = 2, 3, 4, 5, and compare the results with the true value. 
(e) Approximate the integral of part (d) using the result of part (c) of Prob. 12 and finding 
a bound on the integral outside of [—5, 5]. Compare the two results. 
(f) Repeat part (d) for f°,, e~*’ cos x dx. 


Section 4.8 


24 Jacobi polynomials. 
(a) When w(x) = (1 — x}(1 + x)’, a, B > —1, and [a, bj] = [—1, 1], show that 


U,(x) = C,(1 — xP *"(1 + x)P*" 


(b) With C, = (—1)/(2’r!), derive the relations (48-1) and (4.8-2). [For a recurrence 
relation for the Jacobi polynomials, see Jackson (1941), p. 173.] 

(c) Thus verify (4.8-4) and (4.8-5). 

(d) When « = B = 0, verify that Gauss-Jacobi quadrature is identical to Gauss-Legendre 
quadrature. 


*25 (a) When « = B = 1, show that 


2 
J A(x; 1, 1) = nage 1(x) + Pp-1(X) 


where P,,, ,(x) is the Legendre polynomial of degree n + | and p,_ ,(x) is some polynomial of 
degree n — 1. 
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(b) Use the relationship given in Prob. 22c and pertinent orthogonality relationships to 
deduce that p,_ ,(x) must be identically zero. 
(c) Thus derive expressions for the weights and error term in the quadrature formula 


1 


| (1 — x?) f(x) dx = SH, f(a) +E 


j=l 


What are the abscissas? 

26 Chebyshev polynomials. 

(a) With a = B = —4 in Prob. 24 and using C, = (—2)’r!/(2r)!, use (4.8-1) and (4.8-2) to 
show that 


Tt 
A,=2""! = 
Yr 7 
(>) Use these results to derive (4.8-10) and (4.8-11). 
(c) By making the substitution x = cos @ in the integral expressing the orthogonality of 
T,(x) to polynomials of lesser degree, show that this integral implies that 


| T,(cos 8) cosk9d8@=0O k=1,...,r—1 
are) 


(2) Thus deduce that T,(cos 6) = c, cos r8 


(e) Use part (a) to deduce that c, = 1 and thus deduce (4.8-12). 
(f) Show that the Chebyshev polynomials T,(x) satisfy the recurrence relation 


T, + 1(%) = 2x T(x) — T,_ 1(x) 


27 (a) Verify (4.8-13) and (4.8-14). — (b) Thus verify (4.8-15). 

28 Let F(@) be an even, periodic function of 6 of period 27. Show how the Fourier series 
for this function can be converted into a series of Chebyshev polynomials (cf. Sec. 7.6). 

29 (a) Use Gauss-Chebyshev quadrature with n = 2, 3, 4, 5 to approximate the integral 
fi, cos x/(1 — x?)'/? dx and compare the results with the true value and the results of 
Prob. 12e. 

(b) Repeat part (a) for 1, |x| dx/(1 — x*)*/?. 

*30 (a) Derive (4.8-19) by integrating by parts the orthogonality integral for T,, ,(x) and 
requiring that A, = 2’ '. 

(b) Use (4.8-19) to calculate y, and then derive (4.8-20) to (4.8-22). 

*31 (a) Use the change of variable y = J x in (4.6-1) to derive (4.8-23). 

(b) Then use (4.6-10) and the recurrence relation in Prob. 19c to derive (4.8-25) and 
(4.8-26). 

(c) Similarly, derive (4.8-27), (4.8-29), and (4.8-30). 

*32 (a) Use the change of variable y = Jx to derive (4.8-31). 

(b) Then use Sec. 4.8-2 and the recurrence relation in Prob. 26f to derive (4.8-32) to 
(4.8-34). 

33 (a) Approximate the integral {4 ./x cos x dx using (4.8-18) with n = 2. Find a bound 
on the error in your approximation. 

(b) Approximate the integral {3 (x/1 — x)*/? e* dx using (4.8-18) with n = 2. Again find a 
bound on the error. 

34 (a) Remove the singularity in [4 e~ °*/(x? — 1)'/? dx by making a change of variable. 

(b) Consider the integral /§ x°f(x) dx for p > —1 and not an integer where f(x) is regular 
at x = 0 so that the integral has a singularity at 0. By writing 
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t 
| x?f (x) dx = J t?* h(t) 
0 pt+il 
and differentiating derive a differential equation for h(t). [This differential equation can be 
integrated numerically by one of the methods of the next chapter to find h(a), which in turn 
gives the value of the integral.] At ¢ = 0 find initial conditions for h(t) and its first two deriva- 
tives. [Ref.: Hartree (1958), pp. 110-111.] 


35 Consider the problem of calculating 


d -” f(x) 
gy)=—| ——— dx OK<yK<i 
dy I (y — x)'"? 
(a) Why can’t we differentiate under the integral sign? 
(b) Suppose f(x) is given at x = 0, h, 2h, ..., 1. Show how numerical quadrature and 


numerical differentiation can be combined to find an approximation for g(y). 

(c) By making the change of variable x = y sin? 6 and using Lagrangian interpolation to 
get a polynomial approximation for f(x), derive another method for approximating g(y). [Ref.: 
Hamming (1973), pp. 164-165.] 


Section 4.9 


36 (a) In a Gauss-Legendre composite formula if m instead of two subintervals are used, 
show that the only change in (4.9-12) is to replace the 1/2?" factor by 1/m?". 

(b) Derive the equations corresponding to (4.9-13) and (4.9-14) when n=3 and m 
subintervals are used. 

37 (a) Use a Gauss-Legendre composite formula with n = 1 and m = 2, 4, 8 to approxi- 
mate the integral of Prob. 12a. Compare the results with those obtained in Prob. 12. 

(b) Repeat part (a) for n = 2 and m = 2, 4, 8, that is, use (4.9-13). 

(c) Repeat parts (a) and (b) for the integral of part (e) of Prob. 12. Compare the results 
with those obtained in Probs. 12 and 29. 


Section 4.10 


38 Use an argument based on symmetry to prove that when w(x) is symmetric on the 
interval [a,, a,], the closed Newton-Cotes quadrature formula when n is even is exact for 
polynomials of degree n + 1. 


39 In order to derive (4.10-7), consider the integral 
1;=| Pn+(X) dx n even 
a; 
(a) Show that [; = —I,_,-;. 
(b) Use the mean-value theorem for integrals to show that 


ea. —h A,<o <Gja1 
- ny 

j n 

¢ — j<- 

2 


Tie 


(c) Thus, deduce that [/,-,| = |/;|,j < n/2 and from this deduce that q(x) is of constant 
sign in [da, a,]. 

(d) Finally, derive both forms of the error in (4.10-7). [Ref.: Steffensen (1950), 
pp. 155-157.] 
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*40 When n is odd, the error has the form 


1 an 
E = ——___ x) f* D(E) dx 
(n + 1)! [ Pa+il \f (¢) 
(a) Write this integral as the sum of two integrals, one from ay to a,_ , and the other from 
a,_, to a,, and apply the mean-value theorem for integrals to express the latter integral in the 
form (4.10-8). 
(b) Show that 


1 1 
a yi Oem aed E) = 5 Fn) + ¢ 
where c is a constant and y lies in the interval spanned by do, ..., a,-, and x. Hint: Consider 


the Lagrangian interpolation formula with n and n+ 1 points, divide both by p,(x), and 
subtract. 
(c) Use this result to write the former of the two integrals in part (a) in the form 


1 1 
=| Palx) Sn) dx 
where p,(x) = (x — do) **: (x — a,- 1). 
(d) Finally, use the result of the previous problem to derive (4.10-8). [Ref.: Steffensen 
(1950), pp. 162-165.] 


41 (a) Let the quadrature formula 
1 n 
| SI (x) dx % Hy f(—1) + YA, f (aj) 
-1 j=l 


be exact for polynomials of degree 2m — 1 or less (m <n). Prove that if all the weights are 
positive and all the a,s are in [—1, 1], then there exists at least one a, in the interval [— 1, B,] 
where f, is the smallest of the zeros of P,,(x), the derivative of the Legendre polynomial of 
degree m. Hint: Use a quadrature formula where —1 and + 1 are required to be abscissas and 
consider the function 


_ [Pa(x)P(1 — x?) 


I (x 
x — By 
(b) Thus deduce that in the quadrature formula (4.10-1) with [a, b] = {—1, 1] 
2—n 
<B, 
n 


if (4.10-1) is to be exact for polynomials of degree 2m — 1 and all the weights are positive. 
(c) It can be shown that 


<—_* 
' (m — 1)(m + 3) 


B 


for m > 1. Use this and the fact that Newton-Cotes closed formulas are exact for polynomials of 
degree n + 1 when n is even and n when n is odd to prove that the weights of such formulas are 
positive only when n < 7 and n = 9. [Ref.: Bernstein (1937).] 

42 (a) Show that the Newton-Cotes open formulas with n even are exact for polynomials 
of degrees n — 1. 

(b) When w(x) = 1, derive the error terms for the open formulas by using the techniques of 
Probs. 39 and 40. 
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*43 (a) Show that the Newton-Cotes quadrature formulas without the error term may 
be written as 


f(x)dx b—a foxdx fox" ax 
f (ao) 1 Ao wae a’ 
fay) l a, as =0 
f (an) 1 a, ay 


(b) Using an equation analogous to (4.4-1), deduce that the Cotes numbers can be cal- 
culated using the inverse of the matrix 


1 1 1 

Ay a, a, 

A,= | a2 a? 
ay a, 


(c) Let 
n(x) = (x — do)(x — ay) +++ (x — aj_ (x — aj41)°°* (% — @,) 


= > Cux* i=0O,...,n 


Show that the element in the mth row and kth column of S, = A,’ is 


Cn 1,k-1 
Ten — 1(4,,- 1) 
(d) Derive S,(—4, 4) and S,(—1, 0, 1) where the values in parentheses are the abscissas. 
(e) Show how the matrix S,(—1, 0, 1) can be used to generate any quadrature formula of 
the form 


b 
[ wix) f(x) dx = Hy f(—1) + Hy £0) + H2 (1) 
Apply this to the cases (i) w(x) = x and [a, b] = [0, 1] and (ii) w(x) = x? and [a, b] = [—1, 1]. 
[Ref.: Hamming (1973), chap. 15.] 
*44 (a) Use the Lagrangian interpolation formula to prove that 
k 


» (- (0 + jh) = (—1sFA74fP™(E) x —kh<G<xt+kh 


jack 
(b) Use this result with k =2 and the Newton-Cotes closed-type five-point formula 
(n = 4) to derive the formula 


x) dx = aa f(x — 2h) + Bf (x — h) + Bf(x +h) + f(x + 2h)] 


x- 


4h* , , 
+ 2m) - 2 (n2)] 


(c) Similarly, using the Newton-Cotes closed-type seven-point formula, derive Weddle’s 
rule 
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x+3h 


f(x) dx = 7 Efe — 3h) + Sf (x — 2h) + f(x — h) + 6f (x) + f(x + h) 


x—3h 


4 


+ Sf (e+ 2h) + fe + 3h} — Fa [OP () + HF) 


(2) Using the same seven-point formula, derive Hardy's rule 


,xt+3h h 


| IO) dx = 5 


“x-3 


00 [28f(x — 3h) + 162f(x — 2h) + 220f(x) + 162f(x + 2h) 


+ 29f(e + 3A) + 2 (Fn) — Pal 


45 (a) Derive the equation analogous to (4.10-15) for the parabolic rule. 

(b) Use the parabolic rule with m = 4 and 8 to approximate the integral of Example 4.3. 

(c) Use the result of part (a) to get a third approximation. Compare the results with those 
of part (b) and Example 4.3. 

46 By considering the significance of the derivative term in quadrature formula errors, 
discuss why the efficiency of Richardson extrapolation depends on the monotonic convergence 
of successive approximations using an increasing number of subintervals. 

*47 (a) Show that T, , in Romberg integration is the parabolic rule for 2* subintervals. 

(b) Similarly, show that T,, , is the composite rule formed using the Newton-Cotes closed 
formula with n = 4 (five points) and 2* subintervals. 

(c) Show that the leading term in the error of T,, , is of the order of [(6 — a)/2*}?"*"». 

(a) Use this and the number of abscissas that T,, , uses in each subinterval to show that 
T,,,, corresponds to no Newton-Cotes composite rule for m > 3. 

*48 (a) Use induction, (4.10-23), and (4.10-24) to derive (4.10-27). 

(b) Use (4.10-28) to show that )""_ 9 |c,,;| is bounded and, in fact, less than e*/*. [The true 
value of the infinite product in (4.10-28) is approximately 1.969.] 

(c) Use (4.10-27) and the boundedness of the c,,; to show that c,,, _.—;— 0 as m-— oo for 
any j. 

*49 (a) Show that d,,, in (4.10-29) is given by 


Dim = mo + 2Cq1 + 4C m2 + °° + 2? Crap j=1,...,2"**-1 


where p < mand 2? is the greatest power of 2 which divides j. For j = 0 and 2”**, show that d,,, 
is given by one-half the above. 
(b) Use (4.10-27) to show that 


lemil > 3 em, j+1 


(c) Deduce then that d,,, > 0 for all j and m and that 


9 <dj_ <0 
where « = lim,,.. . fm(0) ~ 1.452. 
50 Let ¢(x) be a function such that 
I 
=f, 
6( O,m 


Use (4.10-23) to show that Romberg’s method is equivalent to using Neville’s method of 
iterated interpolation (see Prob. 23, Chap. 3) to approximate (0). 
51 (a) Use Romberg integration to approximate the integral of Prob. 12a. In particular, 


calculate T, 9>,m = 1,..., 6, Compare this result with the results of Probs. 12 and 37 and Table 
48. 
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(b) Repeat part (a) for the integrals of parts (b), (c), and (d) of Prob. 12. 
52 (a) For n= 1, 2 derive the weights for the closed Newton-Cotes quadrature formula 


| vero) dx ~ Yas (a+?) 


so that the formula will be exact for polynomials of degree n or less. 
(b) Repeat part (a) with /x replaced by 1/,/x. 
(c) Use the results of parts (a) and (b) to find approximate values of the integrals of 
Prob. 33a and Example 4.6, respectively, and compare the corresponding results. 
*53 Consider the quadrature formula 


b 


J foe) dx = Hols) Fla] + YH, fla) 


a 


(a) Show that 


C d?-} 


Gm aye a by dr 1 Lae bY ~ 9] 


P,(X) = 
is orthogonal over [a, b] to all polynomials of degree n — 2 or less with respect to the weight 
function (x — a)(x — b) where C and « are arbitrary constants. 

(b) Let the abscissas be the zeros of p,(x). Derive two simultaneous equations for Hy and a 
by requiring the quadrature formula to be exact when f(x) = p,(x) and xp,(x). 
(c) Find an expression for the Hs by requiring the formula to be exact when 


P,(X) 
FO) = 1) (x — a;)p,(a;) 

(d) By using the expression x* = )"_, I(x)a‘, k =0, 1, ..., n— 1, show that the quadra- 
ture formula is exact for polynomials of degree n — 1 or less. Then use this result, the result of 
part (b), and an inductive argument to show that the quadrature formula is exact for polyno- 
mials of degree 2n or less. Hint: Write a general polynomial of degree 2n as )"_ 4 c,x/p,(x) + 
q,—1(x), where q,_ ,(x) is a polynomial of degree n — 1. 

(e) Prove that the zeros of p,(x) are real and distinct. (They can also be shown to lie within 
the interval [a, b].) For n = | and [a, b] = [0, h], determine H,, H,, a, and a,. What advantage 
does this quadrature formula have in composite rules? [Ref.: Ralston (1959).] 


Section 4.11 
54 Verify that if (4.11-10) holds for i = 0,...,n — 1, then /,,(f) as given by (4.11-4) satisfies 
(4.11-2). 


55 Apply adaptive integration to each of the integrals of Prob. 12 with « = .001 for part 
(a), € = 0000001 for part (b), and « = .00001 for the others. 


Section 4.12 


56 Let (x) = exp[—(a/x?) — (b/(1 — x)*)], where a,b>0 and p and q are positive 
integers. Define 


1 


g(x) = | “o(t) dt where k = I g(t) dt 


(a) Show that §§ f(x) dx = 1/k [> f(g(x))b(x) dx = 1/k [§ s(x) dx. 
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(b) Show that if f(x) is infinitely differentiable in J = [0, 1], then so is s(x) and, in addition, 
s(x) vanishes at x = 0 and x = 1 fork =0, 1,.... 

(c) By using the Euler-Maclaurin formula in the form (4.13-17) show that the trapezoidal 
rule should give good accuracy when applied to integrate s(x) over I even if f(x) exhibits 
singular behavior at the endpoints of J. 

(d) Finally, derive the quadrature rules 


[ora  Soli ll) Evra 


where the H, and a,, j = 1,..., n — 1, can be precomputed for various values of n, and indicate 
how these rules can be used to evaluate integrals of functions with endpoint singularities. 


Section 4.13 


57 (a) Use the definitions of the Bernoulli polynomials and numbers and (4.13-6) to 
derive (4.13-7) and (4.13-8). 

(b) Use these identities to derive (4.13-10). 

(c) Use (4.13-2) to generate B,(x), k = 0, 1, 2, 3. 


*58 (a) Show that 


t + t ot oth t 
é-12 22 
and from this deduce the result (4.13-6). 
(b) Similarly, show that 
t t t 
ae hae ieee ail 


and from this deduce that B,,,,(4) = 0, k > 0. 
(c) Use (4.13-2) to show that 


B,(1 — x) + B, = (—1)(B,(x) + B,] 


and thus deduce that for k > 1, B,,(x — 4) is an even function and B,,,,(x — 4) is an odd 
function. 

(d) Use parts (b) and (c), (4.13-7) and (4.13-8) to deduce that, if B,,, ,(x) vanishes anywhere 
in [0, 1] other than at 0, 4, and 1, then B,,, ,(x) has at least five zeros on (0, 1] and B,,(x) + By, 
has at least four zeros. Thus show that B,,- ,(x) must also have five zeros. By continuing this 
argument, find a contradiction and thus deduce that B,,(x) can vanish only at the endpoints of 
(0, 1] and that B,,(x) takes on its extreme value in [0, 1] at x = 4. 


59 (a) Use Prob. 58 to deduce that (4.13-15) can be written 


nh2™+* bp(ams in) oh y 
acca Ls ; I EA | 
En (2m + 2)! | 2m (7) y 


when X9 <4 < Xo + nh. 
(b) Use (4.13-8) and Prob. 58 to show that 


1 
Bam + 2(t) dt = — Ban+2 
“Oo 


(c) Use this result to deduce (4.13-16). 
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*60 (a) Use (4.13-7) and (4.13-8) to show that 
Boy + 2(x) = (2k + 1)(2k + 2)[B,,(x) + Bay] 


(b) Use (4.13-3), Prob. 59b, and part (a) to deduce that B},, (0) has a sign different from 
that of B,,(x) on [0, 1]. 

(c) Deduce from this that B,,(x) and B,, 4 2(x) have opposite signs on [0, 1] and therefore 
that B,, and B,,,, have opposite signs. 

(d) Use this result and (4.13-16) to deduce that, if f°?"* ?(x) and f°?" * (x) do not change 
sign and have the same sign on [x9, Xo + mh], then the error in the Euler-Maclaurin sum 
formula is less than the magnitude of the first neglected term. 


*61 (a) Use (4.13-2) and (4.13-5) to show that 


k 


t 
3f =n 2 DIX (3) )+ Bl 


(b) Use (4.13-5) to show that 


t r* 


2 ez =e in 1B 


(c) Deduce then that 
B,(4) = 2(27* — 1)B, 


(d) Use this result, Eq. (4.13-15), and Prob. 58d to show that if f°" * (x) does not change 
sign on [X9, X9 + nh], then the error in the Euler-Maclaurin sum formula is less than twice the 
first neglected term. 


62 If we sum the ordinates halfway between the successive ordinates in (4.13-14), we can 
derive the second Euler-Maclaurin sum formula [see Steffenson (1950), pp. 134-135]: 


Xo tanh — 7! 2k hake 1 
¥ Slo + (j + 3)h] = an fy) dy ~ ¥ SA Ean 


x [FO D(x9 + nh) — f" ?(Xo)] + Ep 


(1— 2717 ™)Bay4 2h?” *? 
where E,, = —n-————____? pam 
(2m + 2)! f (6) 
where x, < & <x, + nh. Show that when this formula is used as a quadrature formula, it 
corresponds to a composite rule based on the midpoint rule with correction terms. From this 
formula, derive an analog of Gregory’s formula. 


63 (a) Show that if Gregory’s formula (4.13-18) is written to include differences through 
order n, it is exact for polynomials of degree n. Thus deduce that Gregory’s formula with 
differences through order n is equivalent to a Newton-Cotes closed formula with n + | points. 

(b) Write the equivalent to Gregory’s formula using central differences. This formula is 
called Gauss’ formula. (Use the notation of Prob. 13c, Chap. 3.) [Ref.: Hamming (1973), p. 344.] 


64 Euler’s constant y is defined as 


a 
y = lim | y= —In n ~ 57721566 


no \i=1 ! 


(a) Use the Euler-Maclaurin sum formula with x, = 1 to try to approximate y to eight 
decimal places. Explain your results. 

(b) Repeat part (a) with x, = 10 and again explain your results. 

65 (a) Use Gregory’s formula with n = 16 and retaining differences through the second to 
approximate the integral of part (a) of Prob. 12. (Use the results of Prob. 51a to get the 
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trapezoidal-rule approximation.) Compare the results with those of Probs. 12, 37, and 51 and 
Table 4.8. 

(b) Repeat part (a) with n = 8. How do you explain the result? 

66 (a) Let f(x) be a function defined on J = [a, b] such that f'2*~ (a) = f°?*- (b), k = 1, 
..., m. Use the Euler-Maclaurin formula to determine the error when the trapezoidal rule is 
used to integrate f(x) over J. 

(b) Show that there are m bounds on this error and give their form. 

(c) Show why the trapezoidal rule is a good rule to use when integrating a smooth 
periodic function over a full period. 

(d) Why should we not expect Romberg integration to improve on the results of the 
trapezoidal rule for the functions mentioned in parts (a) and (c)? 

(e) Evaluate the integral 

1 


_— a ---- dx = 1.15470054 
Jo 2+sin 10xx 
by the trapezoidal rule using 2, 4, 8, and 16 intervals. Apply Richardson extrapolation to these 
values. 

(/) Evaluate the integral of part (e) using Gauss-Legendre rules with the same number of 
function evaluations as in part (e) and compare the results. 

[Ref.: Davis and Rabinowitz (1975), pp. 76 and 112.] 


67 (a) Show that x” can be written as a linear combination of the x, k = 0, 1,...,n. The 
coefficients s“) in this linear combination are called Stirling numbers of the second kind. 
(6) How would you use part (a) to evaluate 


N-1 

>» P(x) 

x=M 
where p(x) is any polynomial? 


(c) Show that s™ = 1 for any n and s© =0 for n>0. 
(d) Derive the recurrence relation 


sl) = sho Ks 


and with this tabulate the Stirling numbers of the second kind for n = 1, 2, 3, 4, 5. 


68 (a) Verify that S in Example 4.13 equals 4. 
(b) Use the technique of Sec. 4.13-2 to convert the problem of summing 


to one of summing a more rapidly convergent series. 
(c) Repeat part (b) for 


By summing a few terms of S and the more rapidly convergent series, compare the convergence 
of both to the true result 2*/90. [Ref.: Hamming (1973), p. 196.] 

69 Prove that, if (4.13-26) is satisfied when R(x) and R(x) both have denominator degrees 
which exceed the numerator degrees by d, then R(x) — R(x) has a difference in degree between 
numerator and denominator of at least d + 1. 


70 (a) Let Fj) = (—1)v,, so that S = V2, (—1}'v, = ).2 0 F jo. Define 
Fixe = HF jeer t+ Fjste-i) k=1,2... 
I 


> For = U9 —} Avg + § A*v ns (—1)" A"v,/2"*! 


Show that ~ 
2 = 
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So that we have an alternative formulation of the Euler transformation in terms of the elements 
of the original series and successive averages. 
(b) Show that 


1 ™ 1 2 
5b Foe + Fo.me1 = Foo + 5 » Fix 
k=0 k=0 


bho 


] m 
and in general = > Fig t Fjyomer = Fi tx Yo Fjeics 


(c) Consider the following algorithm: 
S<4F 99; j—-0; k—1 


if |Fiul] < |Fyoin-1] then SS + 3F ya; k—k + 1lelse SS + Fy; j—jt! 

repeat 
Show that this chooses an appropriate index m in the modified form of the Euler transformation 
given by (4.13-39). 

(d) Apply this algorithm to Example 4.14. 

71 (a) Let Gjo be the (j + 1)st partial sum of the series S, that is, Gjo = ) Joo Fio, where 
we use the notation of Prob. 70. Define Gy = $(G;..-1 + Gj+1,4-1), K = 1, 2, .... Show that 

-1 


J 1 
Gi, = 2» Fio + 5 yi 
i=0 i=0 


so that Gp, is equal to the Euler transformation with n+ 1 terms and G,,, is equal to the 
modified Euler transformation given by (4.13-39). 

(b) Apply this method of averaging to the series in Example 4.14. 

(c) Apply all three versions of the Euler transformation to compute the value of the series 
S= 2% (—1}(2j + 1)~' to six decimal places. 

[Ref.: Dahlquist and Bjérck (1974), pp. 71-72.] 


CHAPTER 


FIVE 


THE NUMERICAL SOLUTION 
OF ORDINARY 
DIFFERENTIAL EQUATIONS 


5.1 STATEMENT OF THE PROBLEM 


Our concern in this chapter is the solution of the first-order differential 
equation 


~=f(x,y) — y(x0) = Yo (5.1-1) 
We assume that 


1. f(x, y) is defined and continuous in the strip x») <x <b, -w<y<o 
with x, and b finite. 

2. There exists a constant L such that for any x in [x9, b] and any two 
numbers y and y* 


| f(x, y) —f (x, y*)| < Lily — y*| 


These conditions are sufficient to prove that there exists on [xy , b] a unique 
continuous, differentiable function y(x) satisfying (5.1-1).¢ The Lipschitz 
condition (2) is weaker than the assumption that df/¢y is continuous and 
bounded (why?). 


+ The reader who has not recently looked at the basic existence theory for the solution of 
(5.1-1) would be well advised to do so now; see, for example, Henrici (1962a, pp. 15-26). 
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More generally, we may consider (5.1-1) to be a system of N first-order 
differential equations in which y, yo, and fare vectors with N components. 
Much of what we shall develop in this chapter will be equally true for single 
equations and for systems. However, some of our results, particularly those 
on stability, will be developed for single equations only, because with 
systems the algebraic problems become intractable. Considered as a system, 
the formulation (5.1-1) is quite general in the sense that any higher-order 
equation or system of higher-order equations can be reduced to (5.1-1) if 
(and only if) the system can be rewritten with the highest-order derivative in 
each dependent variable appearing as the left-hand side of one equation and 
appearing nowhere else {1}. For example, the equation 


d? 
a3 = f (x, y> y) (5.1-2) 
can be written 
dz dy 
dx =f (x, ys z) dx =2 (5.1-3) 


Our object in solving (5.1-1) will be to find y at a sequence of values of x, 
{x;}. We distinguish here between two types of methods for effecting this 
solution: 


1. Those in which in order to compute an approximation y, to the true 
solution at the point x; information about the solution is required solely 
at the previous point x;_ ,. Such methods are called single-step methods. 
Two examples of this class of methods are Runge-Kutta methods (see 
Sec. 5.8) and Taylor-series methods (see Sec. 5.6-1). 

2. Those in which information is needed at several previous points. These 
are called multistep methods. The case in which the points x; are equally 
spaced has been extensively studied and will be treated here in depth. 


There are also methods which do not clearly fall into either category, 
e.g., the extrapolation methods to be discussed in Sec. 5.9-2 and the so-called 
hybrid methods, which use both information from previous points and addi- 
tional newly generated information. These latter will not be discussed here. 

For reasons which will become clear later in the chapter, our emphasis 
will be on multistep methods, although Runge-Kutta methods have an im- 
portant role to play, as we discuss in Sec. 5.8. 


In deriving methods for the numerical solution of differential equations, 
the following considerations will be important: 


1. How much error is incurred at each step of the computation (truncation 
and roundoff error) and how this error affects the results in subsequent 
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steps. This is the first instance in which we have had to consider the 
propagation of error incurred at one stage of a calculation into later 
stages. This extremely important phenomenon, which occurs in many 
areas of numerical analysis (see, in particular, Chap. 9), is generally dis- 
cussed under the heading of stability, a stable method (see Sec. 1.7) being 
one in which errors incurred at one step do not tend to be magnified in 
later steps (we shall formalize this intuitive notion in Sec. 5.4). 

2. Related to the problem of errors and error propagation is the problem of 
being able to estimate the error at a given stage of the computation as a 
function of computed results. 

3. How the solution can be started. Equation (5.1-1) contains an initial 
condition at x = xX). But multistep methods require values of y at more 
than one point to compute another point. Thus auxiliary means of start- 
ing the computation will often be required. Closely related to this prob- 
lem is that of changing the interval between successive x,’s during the 
course of the computation. 

4. The speed of the method. In the solution of large systems of equations 
(N > 100), the time required for the computation—even on the fastest of 
computers—can be considerable. Since no reasonable discussion of prob- 
lems in numerical analysis can avoid such a practical consideration, speed 
of computation will affect our evaluation of the methods to be derived 
here. 


5.2 NUMERICAL INTEGRATION METHODS 


First we introduce some notation. Let Y(x) be the true solution of (5.1-1) 
and y(x) the calculated solution. Further, let 


¥, = Y(xi) yi = y(xi) 
(= et) v= S ln 9) (52-1) 


hy = Xi41 — X; 


Note that since Y is the true solution, f(x;, Y;) is equal to dY/dx|,.,.. 
However, the function y(x) “exists” only at the points x,;, i = 1, 2, 3,.... 
Thus, when we replace f(x;, y;) by y;, this is a notational convenience. 

A quite general equation for computing y;, which includes both multi- 
step methods and Taylor-series methods, can be derived from the equation 


j 


Y(x;) y: Ay. Y(x;- j) + T (5.2-2) 


I 
im: 


where Apo = 0 and where we use T instead of E for the error to emphasize 
that this is truncation error. Actually, T is usually called the local truncation 
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error because it is the error incurred in one step of the integration (as 
opposed to the global error over many steps; see Sec. 5.4). The equation 
with which we calculate is then 


Vi > 2 Aas; (5.2-3) 
j= = 


with at least one Ajo required to be nonzero so that each calculated value 
depends upon at least one previous value of y. Also at least one A;,; #0, so 
that (5.2-3) indeed depends on the differential equation. By analogy with 
(5.2-1), we can calculate higher derivatives of y than the first. For example, 


"ad =< f(s, 9) (52-4) 
x (x1 
But because this tends to be quite tedious if f(x, y) is at all complicated, and 
because the analysis of methods using higher derivatives can become very 
difficult, we shall restrict ourselves mainly to the case m,; = 1. In Sec. 5.9-1, 
we shall consider briefly methods with m, > 1. 

In this section we shall not only assume that m; = 1 but also that the 
interval h; between steps will be a constant, h, at least over a number of steps, 
although it can be changed whenever error considerations make this desir- 
able; see Sec. 5.6-3. Methods conforming to these assumptions are called 
numerical integration methods. Using these assumptions in (5.2-3), we can re- 
write this equation as 


p p 
YVa+1 = 24 Yn-i +h y biVn-i (5.2-5) 
i= i=—1 


where we have further changed the notation to let the last value of x at which 
y was calculated be x, and to let the number of past values used to compute 
Yn+, be p+ 1. The interval h has been introduced for later notational 
convenience. 

The following points about (5.2-5) should be noted: 


1. In any particular specialization of (5.2-5) any of the a,’s or b;s may be 
zero, but we assume that either a,_, or b,_, is not zero. 

2. If b_,; =0, then y,,, is expressed as a linear combination of (computa- 
tionally) known past values of y,-; and functions of the y,_; and is thus 
easily computed. Formulas with b_, = 0 are called explicit or forward- 
integration formulas. 

3. If b_, #0, (5.2-5) is only an implicit equation for y,,, since 
Yn+1 =S (Xn415 Ynt1) and will generally be solvable only by an iterative 
procedure. It will probably come as no surprise to the reader that the 
greater difficulty of using formulas with b_ , #0 is made up for by their 
more desirable properties (otherwise they would hardly be worth a 
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mention). Formulas with b_ , # Oare called implicit or iterative formulas. 
[Analogously, (5.2-3) is said to be implicit unless all Ap, = 0, in which 
case it is called explicit.] Note, however, that if the differential equation 
(5.1-1) is linear, (5.2-5) can be solved explicitly for y,., {3}. 


When b_, = 0, (5.2-5) is an extrapolation equation in the sense that it 
estimates a value of y at a point x,,,, outside the interval spanned by x, -,;, 
i=0,..., p. When b_, #0, (5.2-5) still defines y,,, as some function of 
Yano +++>Vn- po Yao +++» Yn-p and is thus also an extrapolation equation. Thus 
we can Say that the numerical solution of ordinary differential equations is 
essentially a process of successive extrapolations. 


5.2-1 The Method of Undetermined Coefficients 


We are ready now to consider specifying the coefficients in (5.2-5). As in the 
previous two chapters, we shall do this so as to make (5.2-5) exact if Y(x) is a 
polynomial of some specified degree. By “exact ” here we mean that if Y(x) is 
a polynomial of the specified degree, and if the values on the right-hand side 
of (5.2-5) are true values of the solution, then, except for roundoff, y,, , will 
also be a true value. In contrast to the previous two chapters, however, we 
shall often not specify the coefficients so as to achieve the highest possible 
order of accuracy. If, for example, we have five coefficients and determine 
them so that (5.2-5) is exact for polynomials of degree 3 or less only, then in 
general we shall have one free parameter which we may use to do one or 
more of the following: 


. Make the coefficient in the error term small 

. Make the error-propagation properties of the formula as desirable as 
possible 

3. Give the formula certain other desirable computational properties such 

as zero coefficients 


NO = 


Let us suppose that we wish to make (5.2-5) exact for polynomials of 
degree r. As in the previous chapter, we shall say that such a formula has an 
accuracy of order r or is of order r. Then the method of undetermined 
coefficients consists in considering the r + 1 equations derived by letting 
y; = xi, )=0,..., 7, in (5.2-5). Before actually doing this, we note that (1) 
there is no loss of generality in setting h = 1 because putting y, = x/ in (5.2-5) 
results in the cancellation of h (why?) and (2) there is no loss of generality in 
letting x, = 0 since the coefficients will be independent of the origin of the 
coordinate system (why ?). 
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With h = 1 and x, = 0, we have 


— Y ig + ) b; j=l (5.2-6) 


i= i=-1 


Pp 
1= )( —ila;+j ) (—if"'b, j=2,...,7 
i=0 i=-1 

These are rr + 1 equations for the 2p + 3 or fewer coefficients in (5.2-5) (some 
of the coefficients, for example, b_ ,, may be postulated equal to zero). If the 
number of coefficients is r + 1, then, generally, (5.2-6) can be solved for the 
a;’s and b,’s. If the number of coefficients is greater than r + 1, we shall have 
free parameters in general, and if the number is less than r + 1, there will in 
general be no solution. 

If the first two equation of (5.2-6) are satisfied, we say that the associated 
numerical integration method is consistent. Consistency is thus equivalent to 
(5.2-5) being exact for linear polynomials. All the numerical integration 
procedures we shall consider will be consistent. 


Example 5.1 Determine the coefficients in 


Yn+1 = AoVn + ACD 1 Yna1 + Ooy;) (5.2-7) 
if the formula is to be exact for polynomials of degree 2, that is, r = 2, p = 0. 
The equations (5.2-6) are 
l= dy 1=b_, +, 1=2b_, (5.2-8) 


so that the equation is 
h , t 
Yat = Int 5 (Vast + yi) (5.2-9) 


If instead of polynomials of degree 2 we had required exactness only for polynomials of 
degree 1, then only the first two equations of (5.2-8) would have to be satisfied and (5.2-7) 
would then be 


Ya+1 =Va t+ AL(L — bo)yna1 + D0 Val 


with by a free parameter. 


Equation (5.2-9) is precisely the trapezoidal rule, as is easily seen by 
replacing f(x) by y’(x) in (4.10-11). Since it is the trapezoidal rule, we know 
that the truncation error incurred at each step is —hA°Y’(n)/12, 
Xn <" <X,41- This truncation error needs to be properly interpreted. It 
would be the difference between y, and the true solution (exclusive of round- 
off) if the values on the right-hand side of (5.2-9) were true values. It is also 
worth noting at this point that the truncation error in formulas of the form 
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(5.2-5) will not always be simply determinable as in this case; see the next 
section. 

Example 5.1 is a special case of the general result that any Newton- 
Cotes quadrature formula becomes a numerical integration formula if f(x) is 
replaced by y’(x). More generally, if we replace f(x) by y’(x) in the Lagran- 
gian interpolation formula at equal intervals for the points X,,, X,»-1,--+»Xn-p 
and then integrate between x,_ ; and x,,4, for any/, the result is a formula of 
the form (5.2-5) {5}. We shall not consider this way of generating numerical 
integration formulas any further because the method of undetermined 
coefficients enables us to derive all numerical integration methods, including 
many of interest which are not derivable from Newton-Cotes formulas. We 
shall, however, use the idea of integrating interpolation polynomials to 
derive here two special cases of (5.2-3) and their specialization to numerical 
integration formulas, expressed in a different notation. 

The first class of methods is the Adams-type methods, which have re- 
cently become popular again because of their good stability properties. 
These methods are based on the formula 


Xn+1 


V(Xn+1)= YO) +{ fle Y(x)) dx f(x, ¥(x)) = ¥'(%)_ (5.2-10) 


If we have approximations y,, Y,—15--+5 Ya-p to Y(x) at the points x,, X,-1, 

..5 X,—p» Not necessarily equally spaced, then we also know y,-; =f (X,-;; 
yn-i) =O, ..., p, and we can approximate f(x, Y(x)) by an interpolating 
polynomial Q,(x). This yields the explicit Adams formula 


Xn+1 


Pp 
Yr = Yet] Oplx)dx= Int Vedas (5.2-11) 


Xn 


where the d; are functions of the points x,, ..., X,-p- If, on the other hand, 
we approximate f(x, Y(x)) by an interpolating polynomial Q,, ,(x) involv- 
ing also the unknown value y),., =f (Xn41, Va+1), We get the implicit Adams 
formula 


Pp 
Ynt1=Ynt >, aiyi-i (5.2-12) 
i=—1 


For equally spaced points, it is convenient to write Q,(x) and Q,, ,(x) using 
the Newton back ward-interpolation formulas (3.3-15) based on f, and fi41, 
respectively, where it is convenient to use f; instead of y;. This gives the 
Adams-Bashforth explicit formula 


Yat =Ynt hf +3 Vit tz Vt **) (5.2-13) 
and the Adams-Moulton implicit formula 
Vn+t =Vat W fre — 4 Vin+1 — 72 V*fn+1 — 7) (5.2-14) 


These formulas can also be written in the Lagrangian form (5.2-5). 


THE NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 171 


The second class of methods consists of implicit formulas based on 
numerical differentiation. They are important for stiff problems (see 
Sec. 5.10). The idea here is as follows. If we approximate Y(x) by a polyno- 
mial Q,,. (x) taking on known values y,, ..., y,—p at the points x,,...,Xn->p 
and the unknown value y,,,, at X,+,, differentiate Q,, (x), and set the value 
of O14 1(X,4+1) equal to y,41 =f (X41, Vat1), We get an implicit equation for 
y,+1 Of the form 


p 
Yn+1 = Y CiVn-i + dyns (5.2-15) 
i=0 


where the coefficients depend only on the points x, 1, ..., X,—- In particu- 
lar, if the points are equally spaced, and if we write Q,,,(x) using the 
Newton backward formula (3.3-15) and differentiate, we get the implicit 
formula 


1 
Vyne1 +2 V7Vae1 $7" + VY = hn (5.2-16) 


which can also be written in form (5.2-5). In this case we can derive the 
coefficients a; and b_, by the method of undetermined coefficients. 


5.3 TRUNCATION ERROR IN NUMERICAL 
INTEGRATION METHODS 


As we noticed in the previous section, when a numerical integration method 
is equivalent to one of the numerical quadrature methods of Chap. 4, the 
truncation-error term can be easily written down. In general, however, 
numerical integration methods of the form (5.2-2) are not equivalent to 
numerical quadrature methods. Our object here is to determine the trunca- 
tion error T in (5.2-2) for formulas of the specific type (5.2-5), although the 
method we present is also applicable to the general case (5.2-2). If the numer- 
ical integration method is of the form (5.2-5), the true solution satisfies the 
equation 


p Pp 

Yor= Dah-ith > bY,-i+T, (5.3-1) 
i=0 i=-1 

where the notation J, denotes the truncation error at the step from x, to 


Xn+1: 
In order to find an expression for T,,, it is tempting to assume that 


T, = ch't ty +0(y) (5.3-2) 


where c is a constant and r is the order of accuracy of (5.3-1), in analogy with 
the trapezoidal-rule error and the general form of the error in numerical 
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quadrature methods. If T, has the form (5.3-2), c can be found by letting Y(x) 
be x"*? and substituting this into (5.3-1) (why?). In fact, most numerical 
integration as well as most quadrature formulas of interest do have errors of 
the form (5.3-2). However, as can be seen from Theorem 2.3, the truncation 
error cannot always be written in the form (5.3-2). Indeed, let us assume that 
(5.3-1) is exact when Y(x) isa polynomial of degree r or less. Then we have 


T, = jo "Y"*(s)\G(s) ds (5.3-3) 
where, in this case, the Peano kernel or influence function G(s) satisfies 
1 Pp 
G(s) = 7 (xn41—5)4 — ¥ a(x,-5- —hr 3 b({x,-;— sy ' | (5.3-4) 
: i=0 i=-1 


where, as in (2.3-6), 


(x — s)* x>s 


0 x<s (5.3-5) 


(x —s}/ = 


If G(s) is of constant sign in [x 
of the mean to get 


n-p» Xn+1), We can apply the second law 


r=) { G(s) ds (5.3-6) 


r! 


where x,-,< <X,+4 1. Equation (5.3-6) is indeed of the form (5.3-2) 
(why?). But if G(s) does change sign in [x,_,, X,+1], the error cannot be 
expressed in the form (5.3-2) {7}, although we can still bound the error as 


yer Po) _ 


|T,| < ——, 


|G(s)| ds (5.3-7) 


Some examples of the use of the influence function to find errors are con- 
sidered in the problems {8, 9, 10}. 


Example 5.2 Consider the numerical integration method 
h 
Yaar =(1—a)y, + a¥n—1 + D [(S — a)y,4, + (8 + 8a)y, + (Sa — 1)y,_,]  (5.3-8) 


where the parameter a is to be specified. It can be easily verified that (5.3-8) is exact for 
polynomials of degree 3 or less for all a (and for polynomials of degree 4 when a = 1). 
Using r = 3 in (5.3-4), we get for the influence function 


3!1G(s) = 


(ye — 3) — (1 = alg — 5)? — F105 — aNlayer — 97? + 8 + Bala, — 5) 


X,-1 55S, (5.3-9) 


n-i = 
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from which it is verifiable that for a < 4, G(s) < 0 in [x,-_ ,,x,4,]and fora > 5, G(s) > Oin 
this interval but that for any other value of a, G(s) changes sign in the interval. For a <4 
and a > 5, we can use (5.3-6) to express the error as 


-—— Yiv(n) (5.3-10) 


(The case a = | will be considered in Example 5.5.) 


The method we have used here to find the error in numerical integration 
methods can also be fruitfully applied to finding the error in numerical 
quadrature methods. For consider the general quadrature formula (4.5-1). If 
the accuracy of this formula is r, then by Theorem 2.3, 


E= | f°* (s)G(s) ds (5.3-11) 


b n 
where G(s) = | w(x)(x — s\, ds— >) H,(a; — s), (5.3-12) 

a j=l 
We could have used this technique to calculate most of the error terms in the 
quadrature formulas of Chap. 4 as well as the errors in interpolation and 
numerical differentiation. 


5.4 STABILITY OF NUMERICAL 
INTEGRATION METHODS 


In this section we shall be interested (1) in considering how well the solution 
of the difference equation (5.2-5) approximates that of the differential equa- 
tion and (2) how we can derive bounds for the accumulated error at any 
stage. We shall limit ourselves to the case N = 1, that is, a single first-order 
ordinary differential equation. Some of the results we shall obtain are 
directly applicable to systems of equations. For others, the extension to 
systems leads to generally intractable algebraic problems {11}. 

To get an intuitive feeling for this problem, it is instructive to consider 
the differential equation 


y = —Ky (Xo) = Yo (5.4-1) 
whose solution is 
Y= yoe **- *0) (5.4-2) 
For this differential equation, (5.2-5) becomes 


Pp 
Yn+i(l + AKb_,) = » (a, — hKb;)y,-; (5.4-3) 
i=0 
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This is a linear difference equation with constant coefficients and has the 
solution 


Y= dqn (5.4-4) 
where the c,’s are constants and the r;’s are the roots of 


p 
(1+ AKb_,)r?*! = ¥' (a, — hKb,)r?=! (5.4-5) 
i=0 
We have assumed in (5.4-4) that the roots in (5.4-5) are distinct. If the roots 
are not distinct, then terms of the form c,n*r? will appear in (5.4-4) (where « 
is less than the multiplicity of the root). Such multiple roots will affect the 
details but not the substance of the development that follows. 
The first thing we wish to show is that one of the roots of (5.4-5), say ro, 
has the formt 


ro = 1— Kh+ O(h?) (5.4-6) 


Since we are assuming that (5.2-5) is a consistent method, we get from the 
first equation of (5.2-6) that when h = 0, (5.4-5) has a root rp = 1. For h ¥ 0, 
suppose we write this root as 


ro = l + YB hi (5.4-7) 
i=1 


Substituting this into (5.4-5) and using the second equation of (5.2-6), we can 
show {12} that 8, = — K, which establishes (5.4-6).t Now since 


1— Kh=e° *" + O(h’) (5.4-8) 
we can write 
rk =[1— Kh + O(h*)}* 
= e~ Khk 4 Q(h?) 
= eW Kn x0) + O(h?) (5.4-9) 


The constants c; are determined by the p + 1 initial conditions required to 
solve (5.4-3). We have noted previously that, in general, numerical integra- 
tion methods require starting values that are not available from the state- 
ment of the problem, and in Sec. 5.6 [see also Sec. 5.8] we shall consider 
how to obtain these values. Here we are interested in calculating cy, the 
coefficient of ro. Let yo, ..., yp be the initial values of y, with yo the given 


+ If f(x) = O[g(x)] as x > xg, there exists a positive constant c such that | f(x)| < c|g(x)| 
for x sufficiently close to x,. Except where specified otherwise, we shall always have x, = 0. 

t In fact, the O(h?) term is actually O(h’* '), where r is the order of the numerical integration 
method (why ?). 


THE NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 175 


initial condition of the differential equation. Then (5.4-4) is a system of p + 1 
equations for the c;’s. Using Cramer’s rule, we have 


Yo 1 1 
| lp 
rt rp 
Co = ri —t 5 (5.4-10) 
ro ry 
72 7? 
rb rs 


The initial conditions y,, ..., y, will generally all have errors in them, but it 
is reasonable to assume that as h-+0, these values approach true values. 
This is equivalent to saying that whatever process we use to get the initial 
values, the error will be the order of some power of h; see Sec. 5.6. Thus the 
first column of the numerator approaches yp times the first column of the 
denominator as h- 0, which in turn means that co > yo as h- 0. This, 
together with our result on 79, means that the first term of (5.4-4) approxi- 
mates the true solution of the differential equation and, in fact, approaches it 
ash—0. 

What then of the remaining terms of (5.4-4), the so-called parasitic solu- 
tion? Note that the parasitic solution arises because the order of the differ- 
ence equation (p + 1) is greater than the order of the differential equation 
(1). The same argument used to estimate cg above can be used to show that 
as h — 0, all the other c; approach zero {12}. Thus, for small h, we expect the 
coefficients c,,..., c, to be small. If the solution of the difference equation is 
to be a useful approximation to the solution of the differential equation, each 
of the terms c;r? in (5.4-4) must remain small with respect to cg 7G. This 
requires that 


Iri| < {ro| i=1,...,p (5.4-11) 


The roots r;, i= 0, ..., p, are functions of h. As h + 0, we know that ry > 1. 
Thus, for (5.4-11) to hold as h — 0, it is necessary that all roots of (5.4-5) lie on 
or within the unit circle. If r;, i> 0 lies on the unit circle when h = 0, then 
(5.4-11) may not hold for any positive h, but if all r;, i > 0, lie within the unit 
circle, then for some range of positive h (5.4-11) will hold (why?). Notice that 
for Adams-type methods, there is only one root ro = 1 when h = 0. Thus we 
are always assured an interval in h in which all the roots r;, i > 0, satisfy 
(5.4-11). Our first conclusion, then, is that if the solution of (5.4-3) is to be a 
good approximation to the solution of (5.4-1), then (5.4-11) must hold. 
Now let us consider the general case when the differential equation has 
the form (5.1-1). Because of the derivative terms, we cannot solve (5.2-5) 
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explicitly. But as h > 0, the solution of (5.2-5) must approach that of (5.4-3). 
Thus, for sufficiently small h, we would expect to get results analogous to 
those above. We are now ready to formalize the notion of stability as it 
applies to a general equation of the form (5.1-1). 


5.4-1 Convergence and Stability 


Definition 5.1 Let the initial conditions y, = y,(h), k =0,..., p, used in 
solving the difference equation (5.2-5) be such that 


lim y,(h)= yo k=0,...,p (5.4-12) 
h+0 


where yo is the given initial condition in the differential equation (5.1-1). 
Then the numerical integration method (5.2-5) is said to be convergent if 
for any initial-value problem (5.1-1) such that f(x, y) satisfies the condi- 
tions of Sec. 5.1 the solution of (5.2-5) is such that 


lim y,= Y(x) hn=x—Xpo (5.4-13) 


for all x € [xo, 5]. 


The condition (5.4-12) is, of course, required because the initial condi- 
tions used in (5.2-5) will generally not be true solutions of (5.1-1). Our 
discussion of Eq. (5.4-1) then implies the following theorem. 


Theorem 5.1 A necessary condition for a numerical integration method 
to be convergent is that no root of 


p 
prt S gyri (5.4-14) 
i=0 


that is, (5.4-5) with h = 0, lies outside the unit circle and that roots of 
magnitude 1 are simple. 


PRooF Consider the equation y’ = 0, y(0) = 0 whose exact solution is 
Y(x) = 0. Let the roots of (5.4-14) be ro, r1,...,7, and suppose that they 
are real and simple. Then for this equation y, = )?_9 hrj is a solution of 
(5.2-5) for k =p +1, p+ 2, ... (why?). Moreover, for k =0, ..., p, yy 
satisfies (5.4-12) (why?). In order for y, to satisfy (5.4-13), it is clear that 
the magnitude of each r; must be less than or equal to 1. This proves the 
theorem in this case. In the case of complex or multiple roots, some 
more proof is required; we leave this to a problem {13}. 


For the equation y’ = 0, y(0)=0, the sufficiency of the condition in 
Theorem 5.1 is not hard to prove {13}. In fact, for consistent methods the 
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condition is sufficient in general, but the proof of this is beyond the scope of 
this book [see Henrici (1962a), pp. 244-246]. 

From the first equation of (5.2-6) it follows that r = 1 is always a root of 
(5.4-14). Our discussion of stability later in this section will indicate that it is 
in fact desirable to have the other roots of (5.4-14) as small as possible in 
magnitude. 


Example 5.3 Determine for what values of a the method of Example 5.2 is convergent. 
Equation (5.4-14) is, since p = 1, 
r?=(l-—a)r+a 
whose solutions are 
r=1 and r=-a 


Thus, for convergence —1 <a< 1. Note that a must be greater than —1 to avoid a 
double root at 1. 


Another necessary condition for convergence is given by the following 
theorem. 


Theorem 5.2 A necessary condition for the convergence of a numerical 
integration method is that the first two equations of (5.2-6) must be 
satisfied; i.e., consistency is necessary for convergence. 


PROOF Consider the equation y’ = 0, y(0) = 1, whose exact solution is 
Y(x) = 1. For this equation, the numerical integration method (5.2-5) 
becomes 


Pp 
Yn+1 = > ai Yn-i (5.4-15) 
i=0 


Let the starting values yo, ..., y, be exact, i.e., equal to 1. Now letting 
h-0 and n- oo (nh = x) means that all values of y in (5.4-15) must 
approach 1 if the method is convergent. This proves that the first equa- 
tion of (5.2-6) must be satisfied. Now consider y’ = 1, y(0) = 0, whose 
solution is Y(x) = x. The difference equation now is 


Dp Dp 
Yn+1i = ani +h Y b; (5.4-16) 
i=0 i=-1 

Now consider the sequence defined by 

y, =nhA n=0,1,... (5.4-17) 
Dp 
b; 

where A= (5.4-18) 


Dp 
1+ > ia; 
i=0 
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This sequence satisfies the restrictions (5.4-12) on the initial conditions 
and also satisfies the difference equation (5.4-16) {14}. Since the solution 
of (5.4-16) must approach the true solution as h > 0, n > 00 (nh = x), we 
conclude from (5.4-17) that A = 1. Then (5.4-18) is the second equation 
of (5.2-6), which completes the proof. 


We have defined convergence in the limit as h — 0. But in practice we are 
interested in what happens for finite values of h. First, we should like to 
know when the parasitic solutions of (5.2-5) are small in relation to the 
solution which approximates the solution of the differential equation, and 
when this is so, we should like to be able to estimate or bound the error in 
the computed solution. In order to discuss these matters, it is convenient to 
introduce the accumulated error after n steps, which is the difference between 
the true solution and the computed solution. We define 


€, = Y,— Vn (5.4-19) 


Before we derive an equation for €,, we correct (5.2-5) by introducing the 
roundoff error R, at each step 


Dp Pp 
Vat1 = 2X4 n-i + h DL bin +R, (5.4-20) 
‘= i= 

This roundoff error R,, consists of a combination of roundoff errors in the 
Y,-;, roundoff errors in the computation of y),_; =f(x,-;, ¥,—;), and roun- 
doff errors in the machine computation of (5.2-5). In most solutions of 
ordinary differential equations on digital computers, truncation error is far 
larger than roundoff error. However, in certain applications, particularly in 
the case of real-timet computations, roundoff error may be significant. 
Moreover, roundoff error limits the values of h which can be used. For 
example, suppose we take a simple-minded scheme such as Euler’s method, 
which is derived by dropping all terms involving differences of f, in (5.2-13): 


Yat. =Va thf (Xns Vn) Yo = Y(Xo) (5.4-21) 


It can be shown that this method yields a solution which converges to the 
true solution Y(x) as h-0. However, if we compute with a sequence of 
values of h tending to zero, we shall find that because of roundoff error there 
is a threshold below which we cannot reduce h and still retain a meaningful 
computation. 
Subtracting (5.4-20) from (5.3-1), we get 
Pp Pp 
G41 = DG€,-;+h > be,-,+ E, (5.4-22) 


i=0 i=-1 


+ In real-time applications the computer is intimately tied to an operative physical system, 
e.g., missile tracking or on-line contro! of a chemical plant. In such cases because of inherent 


physical inaccuracies the number of digits used in the computation may be small, and thus 
the roundoff may be large. 
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where E,, = T,, — R, is the error introduced at the step from x, to x,4, (in 
contrast to ¢,, which is the total error that has accumulated after the n steps). 
Using the mean-value theorem, we write 


En-i = Yn-i — Vn-i =S (%n-is Ma-i) —S (%n-is Yn—i) 
= (Y,-i- Yn—i)Sy(Xn— is Nn— i) = €n-iSy(Xn—is Mn-i) — (5.4-23) 
where the subscript y denotes partial differential and y,,_; lies between y,_, 
and Y,_;. Substituting (5.4-23) into (5.4-22), we get 


Dp 
én+1[1 ~~ hb_, A(Xn+15 Nn+1)] = Y [a; + hb; Sy (Xn-is Nn-ilEn-i + E, 
i=0 
(5.4-24) 
When the differential equation is (5.4-1), and we use the simplifying assump- 
tion that the per step error E, is a constant E, this difference equation 


becomes 
Dp 
énai(1 + hb_,K) = ¥ (a, —hb;K)e,-; + E (5.4-25) 
i=0 
which has the same form as (5.4-3) except for the inhomogeneous term E. If 
no roots are multiple, the solution of (5.4-25) is 


P E 
é,= >) a,r? + ———_ (5.4-26) 
i=0 


where we have used the first equation of (5.2-6) in getting the particular 
solution. The r;’s are the same as those in (5.4-4), but the d,s depend on the 
initial conditions on the error for n = 0, 1,..., p. For example, do is given by 
an equation analogous to (5.4-10) with the first column in the numerator 
replaced by the initial errors. We assume that the initial errors are such that 
|do| < |co|, for otherwise the computed solution will be of no use, quite 
aside from the errors caused by the parasitic solution. 

We note that if |r)| > 1 for small h, which corresponds to — K > 0 
[cf. (5.4-6)], the solution is an increasing exponential. Since we cannot expect 
to keep the error bounded when the solution is unbounded, it is not surpris- 
ing that the dyr§ term in (5.4-26) is also unbounded in this case. What we 
can hope to do is to keep the error small relative to the true solution, in which 
case we shall say the method of solution is stable. Since |dy| « [co|, this 
will be true if the terms in (5.4-26) for i = 1,..., p, which correspond to the 
parasitic solution, remain small relative to the ro term. This is equivalent to 
saying that an error introduced in the initial conditions or at a later stage of 
the computation will not propagate with a magnitude which increases rela- 
tive to the magnitude of the true solution. This brings us back to the condi- 
tion (5.4-11). But before we give a formal definition of stability, let us 
consider again the general case of Eq. (5.4-24). 
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This equation 1s not tractable as it stands, so that some simplifying 
assumptions are necessary. The most natural way of modifying (5.4-24) is to 
replace each f, by some constant value — K and E, by E and thus get an 
equation of the form (5.4-25). Let us consider two ways of choosing the 
constants K and E: 


1. Suppose K and E are such that 
[fon y)1<-K  |E,| <E (5.4-27) 


for all n and for all points which occur in the solution of (5.4-24). Con- 
sider a new equation formed from (5.4-24) by replacing f, by — K, E, by E, 
b; by |b;|, and a; by |a,;|. Suppose also that the initial conditions used in 
the solution of this new equation are all greater than or equal to the 
magnitudes of the corresponding initial conditions in (5.4-24). We leave 
to a problem {15} the proof of the result that if |hb_, K| < 1, the solution 
of the new equation is greater than or equal to the magnitude of the 
solution of (5.4-24) for all n. This approach leads to a bound on ¢, which 
tends to be very conservative. 

In order to elucidate how the error propagates from one step to the 
next, which is the essence of stability and which is a local behavior, we use 
a different approach. 

2. Consider (5.4-24) for any value of n. For this n let — K bea characteristic 
or average value of f, for points in the neighborhood of those in (5.4-24). 
Similarly, let E be a characteristic value of E,. Then, since in practice 
both f, and E,, change slowly with n, we expect that locally the solution of 
(5.4-25) will behave like that of (5.4-24). 


In what follows we shall use the latter approach. The stability of a 
numerical integration method will be defined in terms of the solution of the 
characteristic equation (5.4-5) of the difference equation (5.4-25). In (5.4-5), 
therefore, —K is to be taken as a characteristic value of f,(x, y). Thus, 
whether or not a method is stable will depend upon the particular equation 
(5.1-1) to which it is being applied. 


Definition 5.2 Let (5.2-5) be a consistent numerical integration method. 
Let r;, i= 0, ..., p, be the roots of (5.4-5), with ro the root corresponding 
to the term in (5.4-4) which approximates the solution of the differential 
equation. Then this method is said to be (relatively) stable on an interval 
(a, B], which must include zero if, for all hK in this interval, 


“let i=1,..4p (5.4-28) 
ro 
and if, when |r;| = |ro|, r; is a simple root. It is said to be absolutely 


stable on an interval [y, 6] if for all hk in this interval 
Ir;|}<1 i=0,...,p (5.4-29) 
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Remarks 


1 With [a, 2] required to include zero, we are assured that for any K we can 
make the solution stable by choosing h sufficiently small. Since (5.4-28) 
must hold for h = 0, a necessary condition for stability is that when h = 0, 
nor;,i=1,..., p, lie outside the unit circle and that those roots on the 
unit circle must be simple. Thus, from Theorem 5.1, we conclude that 
convergence is necessary for stability. 

2. We allow roots of magnitude equal to rg because if the errors in the initial 
conditions are small, we shall have |d;| «< |d)|, i= 1, ..., p. Thus if 
|7;| = |ro|, although the parasitic solution will not decrease in magni- 
tude relative to the rg term, it will remain small relative to the ro term. 
The requirement of no multiple roots of magnitude |rg| is necessary 
because of the factor n*r? introduced into (5.4-26) by a multiple root. 
Definition 5.2 in fact requires that ro be real. This is reasonable because 
we cannot expect the term in (5.4-4) in rp to be a good approximation to 
the solution of a real differential equation when ry is complex. 

3. From (5.4-6) we see that it makes sense to consider absolute stability only 
for the case hK > 0. We then see that the condition of absolute stability is 
equivalent to requiring that when the solution of (5.1-1) is decreasing in 
magnitude, all the parasitic solutions also decrease in magnitude (why?). 
The determination of those values of hK for which a method is absolutely 
stable is easier than the determination of the interval of relative stability, 
{16, 17}. 

4. In order that a numerical integration method may be stable for as large a 
class of differential equations as possible, it is desirable that [«, B] be as 
large as possible. It follows then that it is desirable to have all the roots of 
the convergence equation (5.4-14), except the one at r = 1, as small as 
possible. Since the roots of (5.4-3) are continuous functions of the 
coefficients, the range of Kh for which (5.4-28) is satisfied will tend to be 
larger the smaller the roots r,;, i= 1, ..., p. In particular, therefore, we 
would like to minimize the maximum magnitude of the r;,i = 1,..., p. The 
Adams-type methods are thus optimal from this point of view since all 
r,=0,i=1,..., p. 

Example 5.4 Determine for what values of hK the consistent method of Example 5.1, that 
is, (5.2-9), is stable. 
Since p = 0, Eq. (5.4-5) becomes 
(1+4hK)r = 1—4hK 
Thus there is only one root 
1—4hK 
"14 4hK 


When h = 0, r = 1, so that by Theorem 5.1 the method is convergent. Since there is just 
one root, Definition 5.2 is trivially satisfied and so this method is stable on the interval 
(— 00, 00). 
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Example 5.5 For a = 1, determine the values of hK for which the consistent method of 
Example 5.2 is stable. 
The equation for the roots is 


(1 + 4hK)r? = — (4hK)r + (1 — 4hK) 
and its solution is 


1 


r= Xi + day’ 3K + [4 + $(hK)?]"/7} 


The plus sign corresponds to ry since this root approaches | ash-0. Ash—>0,r, 7 —1, 
so that the method is convergent. The magnitude of the ratio of the roots is 


_ | —$hK — [4 + $(hK)?]'? 


ry 
~—$hK + [4 + $(hK)?]'? 


ro 
For hK <0 this magnitude is always less than 1, but for hK > 0 it is always greater than 1. 
When hK = 0,r) = 1 andr, = —1. Therefore, this method is stable only on intervals of 
the form [«, 0], where « is any negative number. Whenever f, is negative, that is, hK > 0,r, 
has magnitude greater than r, and we would expect this method to exhibit bad error 
behavior; that this is so we shall indicate in Sec. 5.7-2. Since generally any differential 
equation or system of differential equations is such that f, takes on both positive and 
negative values, this method should be avoided. 


Because convergence is necessary for stability, it is an obvious first step 
in testing the stability of a numerical integration method to see whether the 
roots of (5.4-5) lie on or within the unit circle when h = 0. Two ways of doing 
this which do not require calculating the roots of (5.4-5) are considered in 
{16, 17}. If this necessary condition is satisfied, we may proceed to determine 
the range of values of hK for which the method 1s stable; see Sec. 5.5-4. 


5.4-2 Propagated-Error Bounds and Estimates 


If a stable method is used to compute the solution of (5.4-1), the main 
contribution to the accumulated or propagated error in the summation in 
(5.4-26) is given by the rp term. To estimate dy we use Cramer’s rule and get, 
analogous to (5.4-10), 


e+y 1! 1 
e+y Tr, rp 
ety reo rt 
dy = (5.4-30) 
ro ry ry 
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where y = (—E)/hK )'?._, b; and we have assumed for simplicity that the 
errors in the initial conditions yo, ..., y, are all equal to e.f The root ro, we 
have seen, is close to 1; so, making the further simplifying assumption that it 
is 1, we get 


dg vet+y (5.4-31) 


An estimate of the propagated error is then given by the first term in the 
summation (5.4-26) plus the particular solution 


E E 
yg Km 5.4.32) 
i=1 i 1 i=-1 


é, © doth + 


where we have replaced ry by its approximate value e~ *". [Note the different 
approximations for 79 which led to (5.4-31) and (5.4-32); why are they both 
reasonable?] To get an estimate of the propagated error in the general case 
of Eq. (5.1-1), we replace f, by — K, as in the previous section. If E and K are 
such that the conditions (5.4-27) are satisfied, then (5.4-32) will usually be a 
bound on the error and a quite conservative one [cf. {15}]. 

To use (5.4-32) we need estimates of e, E, and K, and the latter, at least, 
is Often very difficult to get. As we shall see in the next section, there is an 
effective way of estimating the error of each step, and for a stable method, this 
may be used to control the overall error in the computation. However, 
(5.4-32) does indicate what quantities affect the error and how they affect it. 

The definition and discussion of convergence in this section did not 
depend on the fact that only one equation was being considered. Definition 
5.1 and Theorems 5.1 and 5.2 are all valid if the relevant quantities are 
vectors. The discussion of stability, however, does require some 
modifications in the case of systems of equations. By careful application of 
the mean-value theorem in (5.4-23), we arrive at a system of equations 
(5.4-25). For nonzero values of h, this system is cross-coupled; i.e., errors 
corresponding to different dependent variables appear in the same equation. 
The resultant system of polynomial equations corresponding to (5.4-5) is 
generally somewhat intractable {11}. 


5.5 PREDICTOR-CORRECTOR METHODS 
Consider the two numerical integration methods 
h 
Yn+1 = Yat 5 (ns + Yn) (5.5-1) 


+ The error in yo is caused by the necessity of rounding the given initial condition when 
inserting it in the computer. The other y,’s also are in error because of the truncation error in the 
method used to calculate them; see Sec. 5.6. 
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3h 
Ynt1 = Vn-2t > (yn + Yn—1) (5.5-2) 


Both these equations are of order 2 [see Example 5.1 for (5.5-1)], and their 
truncation errors are immediately determinable since they are equivalent to 
a Newton-Cotes closed and open formula, respectively. These truncation 
errors are, respectively, — (h?/12)Y'(n,) and (3h*/4)¥'(n,). Equation (5.5-1), 
which is an iterative formula, is substantially more accurate (by a factor of 9 
in general) than (5.5-2), which is a forward-integration formula. This is an 
illustration of the general rule that for formulas of corresponding order, 
iterative formulas are substantially more accurate than forward formulas. 
Thus, despite their added difficulty in use, it is worthwhile to use them. In this 
section, we consider methods by which this can most efficiently be done. 
Again for convenience we let the number of equations be 1, but the extension 
to systems is straightforward. 


5.5-1 Convergence of the Iterations 


In order to solve (5.2-5) for y,,,; when b_, #0, it will be necessary in 
general to use an iterative procedure. That is, we guess or somehow estimate 
an initial value of y,, ,, call it y©,, calculate f(x, ,, y?,), insert this on the 
right-hand side of (5.2-5) to get y) ,, and continue this process until conver- 
gence to some desired degree of accuracy is obtained. But first we must be 
sure that the process will converge. To derive the condition for convergence, 
we rewrite (5.2-5) as 


as 


yr p= Xf Qi Vn-i + hb; iVn- i) + hb_ i(y Os) (5.5-3) 


where y¥) , is the jth approximation to y,,,. The correct value,t y,41, 
satisfies (5.2-5), which we rewrite for convenience as 


Dp 


Yn+i = 2 (4; Vn—i + hb; y,- i) + hb_sVn+1 (5.5-4) 


Subtracting (5.5-3) from (5.5-4), we get 


Yn+1 — yor = hb_ LVn+1 — (ye 1)'] (5.5-5) 
which, using the mean-value theorem, becomes 
Yn+1 yeep = hb_ 1 A(x Xn+15 nV 1 — y) 1) (5.5-6) 


+ That is, the true solution of (5.2-5), not the true solution of the differential equation. 


THE NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 185 


where n” lies between y,,, and y¥) ,. If in a neighborhood of (x,41, Yn+1) 
which includes all points (x,44, y¥) 1) 


|G. y)| < K (5.5-7) 

thent lYne1 —Yeer| <hb_,K [Yn+1 — Wy (5.5-8) 

and, by induction, yy41— YFP] <(hb-1K)*"|yeer ya | (55-9) 
Thus, if 

hb_,K <1 (5.5-10) 


then, as j 00, y¥7') > y,,, and the iteration converges. Moreover, the 


difference |y,,, — y/t,| is monotonically decreasing for all j. In most 
cases in which we iterate to convergence, we shall assume that h has been 
chosen so that (5.5-10) is satisfied. The magnitude of hb_, K then determines 
the rate at which the iteration converges, so that for rapid convergence we 
should have the hb_, K < 1 {20}. 

For stiff equations (Sec. 5.10), (5.5-10) will not usually hold, so that if we 
must iterate to convergence, we must use some method other than simple 
iteration to solve the nonlinear equations (see Sec. 8.8). 


§.5-2 Predictors and Correctors 


Suppose we are going to use an iterative method to solve the differential 
equation (5.1-1). The only term in (5.5-3) that changes from one iteration to 
the next is the last term on the right-hand side. Thus the major calculation in 
each iteration is the value of (y¥) ,) =f (x,41, YY. ,). In practice, the evalua- 
tion of f(x, y) will generally be much more time-consuming than evaluation 
of the whole right-hand side of (5.5-3). Thus, in comparing the computa- 
tional efficiency of various methods, we shall be interested particularly in 
how many evaluations of f(x, y) are necessary at each step. 

In many cases, we shall perform the iteration of the previous section 
until two successive iterates differ by less than some tolerance. We shall then 
accept the final iterate as y,,,. The number of iterations required [each of 
which requires one evaluation of f(x, y)] will depend upon 


1. The accuracy of the initial guess or estimation. Clearly, the nearer y© , is 


to y,+41, the faster the iteration will converge. 

2. The accuracy desired in the final value of y,, ,. For example, if the error 
E,, at each step is of the order 107 °, there is no reason to require y,,, to 
be correct to 10 decimals. This question is considered in more detail in 
Sec. 5.7-2. 

3. The value of hb_, K. 


+ In what follows, we assume that b_, is positive, as in all practical cases it 1s. 
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Our first object here is to consider ways of predicting y) , as accurately 
as possible consistent with other properties that are desirable in such predic- 
tors. Using the predicted value, we shall then use an iterative formula to 
correct the prediction, hence the name predictor-corrector methods. 

The best way to predict y©, is to use a forward-integration formula 
since such a formula expresses y,,, in terms of known past values of y and 
y’. Thus, for example, (5.5-2) could be used as a predictor for (5.5-1). The 
predictor-corrector system would then be 


3h 
Predictor: yee, = Va-2 + > (Ya + Yn-1) 
(yO? 1)’ =f (Xn+2 y? 1) (5.5-11) 
| ho. 
Corrector: yaee =Yts[OnsY +m] j= 0,1... 


where the corrector is iterated until the desired degree of convergence 1s 
achieved. We noted previously that the truncation error of both predictor 
and corrector involves a third derivative. As we shall see in Sec. 5.5-3, it 1s 
important that both predictor and corrector have error terms with deriva- 
tives of the same order. 

Equation (5.5-11) is a second-order predictor-corrector system. A 
fourth-order system using corresponding open and closed Newton-Cotes 
integration formulas is 


4h 
Predictor: y%,=y,-3+ 3 (2y,— yi-1 + 2y),- 2) 


(yar a) =S (%n+ 4 Yar 1) (5.5-12) 


| hh. 
Corrector: = y¥7) = y,-4 + 3 (ys) +4y,t m1) s=O1,... 


where the truncation-error terms of predictor and corrector are, respectively, 
14n> Y*(n,) and —syh° Y(n2). The corrector is, of course, just Simpson’s rule. 
This predictor-corrector method is known as Milne’s method. In Example 5.5 
we showed that the corrector in (5.5-12) is not stable for any positive value of 
hK, that is, 0f /Oy negative. For this reason, the use of (5.5-12) is not advis- 
able unless the number of steps of the integration to be carried out is too 
small to allow the parasitic solution to achieve a substantial magnitude or 
unless it is known that of /dy is positive. In Sec. 5.5-4, we shall consider a 
modification of (5.5-12) which is stable for sufficiently small values of hK. 

In Fig. 5.1, we have indicated the sequence of steps required to use either 
(5.5-11) or (5.5-12) in proceeding from x, to x,41. The use of convergence 
tolerances and other computational matters alluded to in Fig. 5.1 will be dis- 
cussed in some detail in Sec. 5.7-2. 
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convergence tolerance? 


(j+1) _ 
Then let Yee =Vnel 
and compute 
, — 
Yn+1 =n pInav 
for use in next stage. 


Figure 5.1 The use of predictor- 


Proceed to next integration 
step of integratio corrector methods. 


We are, of course, not restricted to using a Newton-Cotes open formula 
as predictor or a closed formula as corrector. Another choice for predictors 
is the Adams-Bashforth formulas (5.2-13). For example, we have 


p=0: Yaar =Yn t hy, (5.5-13) 


h 
p= I: Ynt1 =\a + 5 BYn — Yn- 1) (5.5-14) 


h 
p=2: Yast = Int 75 (23y, — 16y,-1 + 5y,-2)  (5.5-15) 
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The case p = 0 is called Euler’s method or the point-slope method. As we have 
seen, Adams-Bashforth predictors can all be generated by integrating 
Newton’s backward-interpolation formula, and thus the truncation-error 
terms in (5.5-13) to (5.5-15) are easily found {21}. A similar situation holds 
with correctors given by the Adams-Moulton formulas (5.2-14). 

If the point-slope method is used to integrate (5.4-1), the error for the 
case K <0 is indicated graphically in Fig. 5.2. We see that the calculated 
solution always lags behind the true solution. Clearly, this will be the case 
for any exponentially increasing solution. A formula that attempts to avoid 
this difficulty is the midpoint method 


Yn+1 = Vn-1 + 2hy, (5.5-16) 


where the derivative used is at the midpoint between the two abscissas. The 
midpoint method is a special case of a method known as Nystrom’s 
method {23}. 

Another class of predictors can be derived from the Hermite or modified 
Hermite interpolation formulas by letting x = x,,, and replacing f(x) by 
y(x). A fifth-order example of such a formula derived from (3.7-17) with 
n= 3 is 


Yn+1 = — 18y, + IV n- 1+ 10y,- 2 + h(9y,, + 18y,,- 1+ 3yn- 2) (5.5-17) 


Note that the fluctuation in coefficients means that this formula has bad 
roundoff properties. Other formulas of this type are considered in {24}; see 
also P, in Table 5.1. 

On a digital computer it is common to use fourth-order predictor- 
corrector methods since methods of this order provide sufficient accuracy 
for most problems and are reasonably straightfoward to derive and use. 
However, there is a tendency now to use higher-order methods at a point 
where the function is assumed to be smooth on the basis of the results of the 
calculation up to that point. Conversely, low-order methods are used when 
the function starts to show singular behavior. Such a variable-order method 1s 
thus seen to be adaptive in the sense of Sec. 4.11. When combined with the 
choice of step length based on the best estimate consistent with error con- 


~Computed 
solution 


x 
Xo y %2 %3 <4 Figure 5.2 The point-slope method. 
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siderations, this leads to the very efficient and accurate variable-order- 
variable-step integration methods (see Sec. 5.7-1). Since the implementation 
of these methods is quite complicated, it is worth using them only for large 
systems of equations or for integration over a large interval. For small 
systems, the fourth-order methods studied here are adequate. 

In Table 5.1 we have listed a number of fourth-order predictors. The 
error coefficient is the coefficient of h° Y’(n) in the error term. The choice of a 
predictor is not nearly so critical as the choice of a corrector. By far the most 
important factor is low truncation error in order to assure as good a predic- 
tion as possible. Other factors of some importance are (1) ease of computa- 
tion; e.g., zero coefficients make the evaluation of the predictor easier, and 
(2) roundoff properties; note that the extremely bad roundoff properties of 
P. lower its value despite its extremely good truncation error. 

The choice of a corrector depends importantly on stability. For this 
reason we defer discussion of the choice of a corrector until Sec. 5.5-4, in 
which corrector stability is considered. 


§.5-3 Error Estimation 


At each step in the use of predictor-corrector methods, we get two estimates 
of the solution at x, ,, the predicted value and the corrected value. This 
enables us to obtain an estimate, called a Milne-type estimate, of the error 
incurred at each step. This estimation, which gives us an idea of the order of 
magnitude of the error, enables us to judge whether to accept the corrected 
value or to reject it. If we reject it, we must repeat the step with a 
smaller value of h. If we accept it, the estimate enables us to determine 
whether to (1) decrease the interval size h if the error is too large or (2) 
increase the interval size and thereby speed up the computation if the error 
is smaller than needed for the accuracy we desire or (3) leave the interval 


Table 5.1 Fourth-order predictors 


Py P, P, P, Ps P. 
(Adams) (Milne) 

Go i 0 0 0 n _9 
ay 0 0 0 1 4 9 
ay 0 0 I 0 4 1 
a, 0 ! 0 0 0 0 
b a cs ee ee ee: 
by — 34 —3 —3 —3 — $3 6 
b, 34 3 e $ 2 0 
D3 7a 0 3 -} -2 0 
Error #31 324 243 232 242 72. 


coefficient 
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unchanged. Implicit in the above is an assumption to be used throughout 
this section, namely that truncation error 1s the dominant source of error. 

We shall use the method (5.5-12) to illustrate error estimation. When 
(5.3-1) and the definition of ¢, in (5.4-19) are used, the predictor in (5.5-12) 
can be written 


4h 
yo, =Y,-3+ 3 (2Y,— Yn-1 + 2YVn-2) — 2-3 
4 U U / 
4h ! ! ' 14,5 
3 (Ze, — €,—1 + 2,2) — ash? Y"(m1) (5.5-18) 


Similarly, removing the superscripts, we can write the corrector as 


h 
Vat. = Yeti — €n-1 — 3 (ene 1 + 46 + 6.1) t+ oph? Y"(n2) (5.5-19) 
where y,,,, is the value that would be obtained if the corrector were iterated 
to convergence. Subtracting (5.5-18) from (5.5-19), we get 


Yar — VOD, = a0h? Y'(n2) + Khe Y (m1) 


h 
—3 (6,41 — 46, + Sé,-1 — 8€,-2) + G3 — Gn-1 (5.5-20) 
If we assume (1) that ¢; changes slowly from step to step and (2) that (h/3)e; is 
small compared with the truncation error, we may drop the e; and ¢€; terms 
from (5.5-20). Then, if we further assume that Y’(x) does not change greatly 
between n, and 7,, we can write 


Yat. — Yao & S6h° Y"(n) (5.5-21) 
Thus an estimate of h> Y’(n) is given by 
hS¥"(n) © 38(ne1 — Ys) (5.5-22) 
The truncation error incurred at each step 7, can then be estimated as 
T, = —96h° Y"(n) © —35(Yn+1 — Yar) (5.5-23) 


Therefore, we can use the difference in the predicted and corrected values to 
estimate how much error is being made at each step. 
The truncation error in the predictor can be estimated as 


TY) = 48h? Y"(q) © 33(Vn+1 — y? 1) (5.5-24) 


This not only enables us to estimate how good our predictions are but also 
lets us improve the prediction. For, assuming that the difference between the 
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predicted and corrected values at each step changes slowly, we can estimate 
T°) as 


Tr © 33(Yn — Yn) (5.5-25) 


Therefore, We = VO + Z8(yn — yl) (5.5-26) 


will, in general, be an improved value of the prediction. The complete 
predictor-corrector method (5.5-12) then is 


. 4h 
Predictor: yy = Yn-3 + > (2Yn — Yn-1 + 2Yn- 2) 


3 
Modifier: yet = Yat + 35(Yn — Yn) (5.5-27) 


(FY =f (Xn41 ye? 1) 
h 
Corrector: yi) = ya—a + 3 [(yit a) + 4yn + Yn 1] j=0,1,... 


with (7° ,)’ being used in the corrector initially.t 

This procedure that we have illustrated for (5.5-12) can clearly be used 
for any predictor-corrector method as long as the order of predictor and 
corrector are the same. The corresponding case for (5.5-11) is considered in 
{25}. 

A reasonable question to ask at this point 1s why we do not use the 
estimate of the corrector truncation error (5.5-23) to improve the corrected 
value. In fact this can be done, but, as shown in {26}, doing this is equivalent 
to using a system of one higher order (in this case, 5). Moreover, correcting 
the corrector affects the stability properties of the corrector. Therefore, 
rather than correcting the corrector, it is probably a better idea to use a 
higher-order system in the first place. 


55-4 Stability 


Two basic factors determine the value of a given corrector formula in com- 
parison with others of the same order: (1) the coefficient in the error term 
and (2) its stability properties. That these two properties tend to work 
against each other will probably not surprise the reader. Other factors of 
subsidiary importance are (3) the roundoff properties and (4) the ease with 
which it can be computed (zero coefficients, “ simple ” coefficients, etc.). Our 
aim here is to develop a fourth-order corrector with as desirable properties 
as possible. To do this, we consider a corrector of the form 


Vn 1 = Ao Vn + A: Vn—1 + A2Vn—2 + h(D~ 1 Vne1 + DoVn + bi Vn-1) (55-28) 


+ At the first predictor-corrector step, i.e., after starting values have been computed (see 
Sec. 5.6), there will be no previous value of y to use in the modifier, which therefore should be 
omitted at this step. 
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which uses data at only the last three points and contains six coefficients, one 
more than is necessary to achieve an order of 4. We shall use this extra 
degree of freedom to give the corrector desirable stability properties. We 
could also include a term b, y,,_ , in (5.5-28) and thus achieve another degree 
of freedom, still using data at only three past points. This other degree of 
freedom could, for example, be used to achieve good roundoff properties. 
This is considered in {27}. 

The requirement that (5.5-28) be exact for y(x) = x’, j =0,..., 4, leads to 
the equations {5} 


where a, is the free parameter. The influence function in the truncation 
error 1s given by 


G(s) = (Xn+1 7 s)4 ~~ Ag(X, 7 s)* _ ay (X,- 1 s)4 
— a,(x,-, — s)4 — 4h[b_ ,(x,41, — 5). 
+ bo(x, — s). + by(x, — 1)3] (5.5-30) 


It is naturally of interest to determine when G(s) is of constant sign in 
[xX,-2>X,+1] as a function of a,. We leave determination of this toa problem 
{10}, but note here that for a, in [—.6, 1.0] G(s) is indeed of constant sign. 

The stability equation for (5.5-28), that is, the equation corresponding to 
(5.4-5), is 


(1 + hKb_,)r* = (ag — hKbo)r? + (a, —hKb,)r +a, (5.5-31) 


To determine the value of hK for which (5.4-28) holds as a function of a, is 
quite difficult for this cubic. We start by considering the case h = 0. Asa 
function of a,, the three roots are shown in Fig. 5.3. We conclude that only 
in the interval —.6 < a, < 1.0 can there be stability. Because the roots of 
(5.5-31) are continuous functions of hK, for any value of a, interior to 
[—.6, 1.0] there will be some range of hK for which the method is stable. In 
order to get some insight into the best value of a, to choose, consider the 
data in Table 5.2. The error coefficient is the coefficient of h°> Y*(y) in the 
error term. We see that the error coefficient steadily decreases from 1.0 to 
—.6 [note that a, = 1 is Milne’s corrector in (5.5-12)]. Also the roundoff 
properties, judged by the sum of the squares of ay), a,, and a,, become 
steadily worse in this direction from 334 on down. A reasonable choice would 
seem to be a, = O since it is near the center of the interval [and thus is likely 
to be stable for a greater range of hK than values near the ends of the 
interval (why?)], has one zero coefficient, a reasonable error-term coefficient, 
and reasonable roundoff properties. In Fig. 5.4 we have plotted the roots of 
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—6 a, Figure 5.3 Roots of (5.5-31) with h = 0. 


Figure 5.4 Roots of (5.5-31) for a, = 0. 
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(5.5-31) for ay = 0. For hK < +.69 and all negative hK of importance, the 
stability condition (5.4-28) is always satisfied. 

When a, = 0, the condition (5.5-10) for convergence of the iterations is 
that |hK | be less than 8 since b_ , = 3. In fact, in order to get rapid conver- 
gence of the corrector, we would want |hK| «<4. Thus the requirement 
hK < +.69 is not restrictive in practice. 

Our conclusion then is that the corrector (5.5-28) with a, = 0 


3h , , , 
Vat. = 8(9Vn — Vn—2) + z (Vier + 2y, — Yn-1) (5.5-32) 


is a desirable fourth-order corrector to use in place of Milne’s corrector in 
(5.5-12). Using this corrector and Milne’s predictor, we obtain Hamming’s 
method {29} 


4h 
Predictor: Veet =Yn-3 + 3 (2In — Yn-1 + 2Yn- 2) 
Modifier: Voor = Veer +121, — YO) (5.5-33) 


(Fees) = f(%n+ a» Int) 
Corrector: yF) = §(9¥n — Yn-2) 
3h j f / r . 
+E [On + 2Yn — Yn 1] j=0,1,... 


Using Table 5.2 we get, analogous to (5.5-23), 
T, = —aoh?Y"(n) © rine — Yo) (5.5-34) 


In Sec. 5.7-2 we shall give some numerical examples comparing this method 
with (5.5-27). An analysis similar to that in this section is possible for 
methods of any order {29}. 


Table 5.2 Correctors for sample values of a, 


ho 


9 
a; ! 17 3 0 —4 


31 5 
(Milne) (Hamming) 

2. 9 9 45 9 

Ao 0 i7 | 8 7 3 ; 

Zz i _1 _ 3 

ay l 17 9 0 7 31 5 

=e i _t dt _ > ti 

a, 0 ~17 ~%9 8 7 31 5 

b i 6. 19 3 8 12 2 

-1 3 17 27 8 21 31 5 

b 4 18 22 3 2 18 2 

0 3 17 27 4 3 31 5 

b 1 0 _ 8 _3 10 18 _4 

1 3 27 8 21 3 5 
a 19 _ _ i7 _ 9. _ 

Error —90 —739 —Bi0 40 630 310 30 


coefficient 
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Another useful fourth-order corrector, though not of the form (5.5-28), is 
the Adams-Moulton corrector 


h 
Yn = Yn + a6 (251y,41 + 646y, — 264y),_, + 106y,- 2 — 19y,- 3) 


with error coefficient — 35. The stability characteristics of this formula are 
much better that those of (5.5-32) in that the interval of absolute stability is 
(0, 3.0) compared with (0, .69) for formula (5.5-32). In general, the Adams- 
Moulton formulas (5.2-14) have good stability characteristics. One reason is 
that for h = 0, all extraneous roots vanish. In addition, the value of y,,, 
is directly coupled to the previous value y, by an integral, and integration 
is a smoothing operation. 


5.6 STARTING THE SOLUTION AND 
CHANGING THE INTERVAL 


We come now to two problems we have thus far put off: 


1. How do we obtain the starting values (initial conditions) for (5.2-5) that 
are required besides the initial condition of the differential equation? 

2. How do we change the interval h during the computation if our error 
estimate indicates that this is desirable? 


There are, of course, some self-starting methods—Eq. (5.5-13) is an 
example—which require no starting values other than that provided by the 
differential equation. While these methods are of low order and therefore not 
sufficiently accurate for most problems, they are very useful in variable- 
order-variable-step methods (Sec. 5.7-1). In this section, however, we shall 
be concerned with starting values and interval changes for higher-order 
methods which are to be used over the whole range of integration. 


5.6-1 Analytic Methods 


Taylor series The Taylor-series expansion of y(x) about xo can be written 


h?s? 
y(Xo + hs) = Yo + hsyo + S-Yo + °* (5.6-1) 


Using the given initial condition yo, we can, using (5.1-1), calculate yo. Then 
by differentiating the differential equation, higher derivatives of y at x can 
be calculated. Thus (5.6-1) can be used to approximate y(x9 + hs) for any s 
for which the series converges. 
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Example 5.6 Use (5.6-1) to calculate initial values at x,, x,, and x, for the differential 
equation y' = y’, y(0) = 1 with h = 01. 
We calculate as follows: 


y(0)=1 9 y"=2yy y"(0) = 2 
y” _ 2(y’)? + 2yy” y”(0) =2+4=6 etc. 
Then y(hs) = 1 + Ols + 1074s? + 107 %s3 +--- 


which can be used with s = 1, 2, 3, to approximate y,, y2, y3 {35}. 


If the Taylor series converges for the requisite values of x, this process 
can be used to get imitial values of any desired accuracy. Customarily, one 
would desire the accuracy of the initial values to be at least as great as that of 
the numerical integration procedure to be used. An alternative approach to 
the above is considered in {34}. 


The method of successive approximations (Picard’s method) Equation (5.1-1) 
can be written 


x 


yx) =yo + | f(s y) dx (5.6-2) 
xO 

Assuming an initial approximation y(x) = yo(x) and inserting it on the right- 
hand side of (5.6-2), we generate an approximation y,(x). This process can 
be iterated, and if f(x, y) satisfies the conditions of page 164 [see Ince 
(1926)], the process will converge in a neighborhood of x). Thus we can 
obtain approximations to Y(x) at the desired values of x. This technique is of 
great importance in the theory of the existence of solutions of (5.1-1), but the 
difficulty of evaluating the integral in (5.6-2) makes it impractical for numer- 
ical computations. 

While both these methods involve analytic operations, the Taylor-series 
method involves only analytic differentiation and can be mechanized quite 
readily on a digital computer. In fact, the Taylor-series method has been 
proposed as a general-purpose numerical integration method, and programs 
exist for solving systems of differential equations by Taylor series using 
analytic continuation methods. However, if analytic differentiation capabili- 
ties are available, the Hermite methods discussed in Sec. 5.9-1 must also be 
considered. Picard’s method, on the other hand, involves indefinite integra- 
tion. While programs have been written to mechanize this process, they do 
not always work, even when the integral can be expressed in terms of 
elementary functions. Moreover, there exist many functions expressed in 
terms of elementary functions for which the indefinite integral cannot be so 
expressed. A simple example of this is the function e*’. Hence the iteration 
process may break down very quickly. 
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5.6-2 A Numerical Method 


Suppose, as in Example 5.6, that three values of y besides yo are needed. Let 
us approximate y'(x) by a Lagrangian interpolation formula using the four 
equally spaced abscissas x9, x,, X2, and x,. Then we insert the result of this 
on the right-hand side of (5.6-2) and integrate from x, to, respectively, x,, 
X,, and x3. So doing, we get {36}. 


h 
Yi=Yot a4 (Vo + 19y, — Sy) + y3) 
h / / , 
y2=Yot 3 (Yo + 4y', + y%) (5.6-3) 


| hi, 
y¥3 = Yo + 3 3Y6 + Dy, + 9y2 + 3y3) 


where the error term in all three equations can be shown to be O(h°) {36}. To 
use (5.6-3), we make an initial estimate of y,, y,, and y3, use (5.1-1) to 
calculate y, y,, and y3, and then use (5.6-3) to get new values of y,, y., and 
y3. This process can be iterated and if it converges it will give the four 
required starting values. Note, for example, that the method (5.5-33) requires 
precisely four starting values. 

Two disadvantages of this method are its possible lack of convergence 
and, in any case, the tedious computation involved. It can, of course, be 
mechanized but has no advantage in this respect over the Runge-Kutta 
methods to be discussed in Sec. 5.8. As we shall see, these methods are not 
only the most desirable methods to use in generating starting values, but 
they have their proponents as a general method for numerical solution of 
differential equations competitive with predictor-corrector methods. 


§.6-3 Changing the Interval 


Usually the solution of (5.1-1) is desired for some final value of x, say x, and 
at this value of x it is desired that the value of y should be in error by no 
more than some predetermined tolerance. Normally the initial value of h will 
be chosen so that if the a priori estimate of the per step error 1s correct, the 
solution will have the desired accuracy. However, as the computation 
proceeds, the error-estimation procedure of Sec. 5.5-3 may indicate that (1) 
the per step error is larger at each step than is allowable if the final value of y 
is to have the desired accuracy or (2) the error is significantly smaller than is 
necessary. Assuming, as we shall here, that truncation error 1s the dominant 
factor, the indicated action in the first case above is to decrease the step size 
h since truncation error depends on a power of h. Conversely, in the second 
case, we increase the step size so that the computation will proceed more 
rapidly. Since the predictor-corrector methods we have been considering 
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require a constant step size, whenever the interval is changed, more starting 
values must be generated. For example, suppose the second-order method 
(5.5-11) is being used and after y, has been computed, it is desired to halve 
the value of h. Then in calculating (x, + 5h) the predictor requires values of 
y at x,,X, —3h, and x,_,. Thus, in order to proceed with the computation, 
we need the value of y at x, — 5h. An effective method for doing this would 
be to use a single-step method such as a second-order Runge-Kutta method 
(see Sec. 5.8-2) [because (5.5-11) is a second-order method] starting from 
X,- 1 With a step size h/2. An alternative procedure would be to use such a 
method starting from x, to generate values of y at x, + 3h and x, + h and 
then switch back to the use of the predictor-corrector method. The use of a 
single-step method such as a Runge-Kutta method to change the interval h is 
particularly simple on a digital computer when the method has been used to 
calculate starting values and is therefore part of the computer program. 
These methods are, in fact, easily used when the interval h is to be increased 
or decreased by any factor whatsoever. 

Another approach to changing the interval 1s to use one of the interpola- 
tion formulas of Chap. 3 and previously computed values of y(x) and y'(x) to 
interpolate or extrapolate to get those values of y [such as y(x,, — $h)] which 
are required to continue the computation at the new interval. Since y'(x) = 
f (x, y(x)) is also available at the grid points, the Hermite interpolation 
formulas (Sec. 3.7) are most appropriate. A particular example of this is 
considered in {37}. With such interpolation or extrapolation procedures, it is 
convenient to halve the interval when decreasing it and to double it when 
increasing it. But when the interval is to be doubled, it is possible to avoid 
the use of Runge-Kutta methods or interpolation methods entirely {37}. In 
Sec. 5.7-2, we shall give an example of changing the interval in practice. 


5.7 USING PREDICTOR-CORRECTOR METHODS 


One computational problem that we have deferred thus far is the determina- 
tion of when to stop iterating the corrector. Let us assume that because of 
the final accuracy that we desire in our solution we have determined a 
bound on the per step error or the per step relative error which if not exceeded 
during the computation will enable us to achieve the desired accuracy. It is 
reasonable then to require that any error made by not iterating the corrector 
to convergence be small compared with the allowable error, or to put it 
another way, this error should be small compared with truncation and 
roundoff errors. 

Let y be the value that would be obtained if the corrector were iterated 
to convergence, and let y be the value of the ith iterate (with y the 
predicted value) where, for convenience, we have dropped the subscript n. As 
in Sec. 5.5-1 we assume that b_, is positive. Then, since only the b_ , term in 
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the numerical integration method (5.2-5) changes from one iteration to the 
next, 


yh — yO = hb- yy — PY] (5.7-1) 
Let |(yY — (y"Y | = 6;. Then 
[yt — yO| = hb_, 6; (5.7-2) 
Using the mean-value theorem, 
Sinn = [OV PY — YY] = [£06 v) -F(%, ¥)| 
< [y"*) — yOlK =hb_, 6;K  (5.7-3) 
if |Of/0y| < K in the region of interest. Then from (5.7-2) 
| yf? — ylF)) < (hb_,)? 6K (5.7-4) 
and in this way the differences of successive iterates can be bounded. Now 
ly — yD] < [yD — yD) 4 [yr sy yer ay) 4. 
and using the above results, we get 
ly — yi) | < n2b2, 6,K(1 + hb_, K + h?b2,, K? +---) 


_ hb? , 5;K 


1 hb_,K (5.7-5) 


if |hb_, K| < 1 which, in fact, it must be for convergence of the corrector 
iterations. Indeed, in order to get rapid convergence of the corrector, we 
have noted that we should have |hb_, K| < 1. Thus we have the result that 
h*b2 , 6; K should be a good approximation to the right-hand side of (5.7-5). 
In fact, since we have considered only the maximum values of quantities in 
this derivation, we expect that h*b2. , 6, K will be a quite conservative bound 
on the error incurred by stopping the iteration after the computation of (y)’ 
and y"*!). This suggests the following procedure. After each corrector itera- 
tion, compute 6; and compare it with a convergence factor chosen so that if 6; 
is less than the convergence factor, termininating the iteration with (y")’ and 
y"*) will result in a value of h?b2 , 6; K which is small compared with the 
allowable per step error. We would compare 6; with the product of the 
convergence factor and y" if we were interested in controlling the relative 
error. In either case the test requires some estimate of K, the bound on 
| Of/ey|. Since 


Tn OD |= rote] (5.7-6) 
th yO — ylm 1) ~ fy@ — yd! . 
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and since, as we have noted, h*b2 , 6, K is a conservative bound on the error 
incurred by terminating the iteration, equation (5.7-6) can be used as the 
estimate of K. Note that this estimate is appropriate only for a single differ- 
ential equation. For a system of equations the situation 1s much more 
complicated. 

Generally in the numerical solution of differential equations, the test 
described above will be satisfied after the first application of the corrector, 
that is, i = 1, because the modified predicted value itself will usually be quite 
accurate. Thus generally only two evaluations of f(x, y) are required at each 
step, one after computing the modified predicted value and one after com- 
puting the first corrected value. This is why fourth-order predictor-corrector 
methods are generally substantially faster than fourth-order Runge-Kutta 
methods, which require four evaluations of f(x, y) per step (see Sec. 5.8). 

There is another way of using predictor-corrector pairs, namely to deter- 
mine beforehand a positive integer m, usually taken to be 1, and to iterate the 
corrector exactly m times. This leads to an integration method' which has 
different properties from that method in which the corrector 1s iterated to 
convergence. In the latter case, the properties of the method are independent 
of which predictor is used, while in the former case, the choice of predictor 
has a considerable influence on the properties of the method, especially its 
stability properties. Thus, if we correct only once, the stability equation 
takes the form 


Pp 


P 
(1 + hKb_,)r?** = }" (a; — hKb,)r?“' — hKb_, > (a¥ — hKb¥)r?' 


i=0 i=0 
(5.7-7) 
where the predictor has the form 
p Dp 
Yur = Laty,-it hd by, -i (5.7-8) 
i=0 i=0 


and the corrector is given by (5.2-5) with b_, > 0. In the usual fourth-order 
case, the Adams-Bashforth-Moulton pair has an absolute stability interval 
of (0, 1.3) while the Adams-Moulton formula corrected to convergence has 
an interval of (0, 3.0). 

A curious situation can arise in using this mode of operation, as is il- 
lustrated by the Milne predictor-corrector pair (5.5-12). If the corrector is 
applied only once, the absolute stability interval becomes (.3, .8). This means 
that by reducing h, one can enter a region of instability! In this mode, we can 
also modify the predictor but at the cost of changing the stability equation 
so that it is almost impossible to analyze it. Similarly, we can use a Milne- 
type estimate to monitor the calculation so as to determine when to change 
the step size. The estimate is, of course, the difference between the predictor 
and the mth corrector. 
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§.7-1 Variable-Order—Variable-Step Methods 


In these methods, the computation is started with a low-order rule and a 
small step size so that they are self-starting. As the computation proceeds, 
both the order and the step size are modified as the circumstances demand 
so that at each point the optimal order and step size are chosen, subject to 
certain constraints. The goal is to reach the endpoint b of the interval of 
interest [a, b] with a solution satisfying a given accuracy criterion using the 
least amount of computation. Of course, stability considerations must not be 
neglected in the course of the computation, but the stability theory in these 
methods is not yet fully understood and decisions are based on experimental 
evidence. One decision which is usually made is to use Adams-type methods 
since they have the best stability properties among the constant-order- 
constant-step methods. 

There is a variety of variable-order—variable-step methods, each employ- 
ing some of the above techniques. Thus, in some methods the corrector is 
iterated to convergence, while in others it is applied only once. In some 
methods, Milne-type estimates are used for the local truncation error while 
in others what corresponds to the first neglected difference is used. In some 
methods constraints are imposed on the changing of the order and in others 
on the changing of the step size. This is necessary since experimental 
evidence indicates that changing both the order and the step size too 
frequently introduces instabilities into the computation in a way which is 
not too well understood theoretically. We present here one of several 
strategies for step size and order adjustment. Other approaches can be found 
in the references cited in the Bibliography. 

Let us assume now that we are integrating from the point x, using the 
step size h,, and a method of order g. We thus compute a value y,,, at the 
point x,,; =x, +h, and an estimate T, of the local error (see, for example, 
Sec. 5.5-3), where 


7 yh? (5.7-9) 
We now check whether this integration step has been successful by testing if 
| T,| < &h, (5.7-10) 


where € = ¢/H and H = b — ais the length of the interval over which errors 
of size ¢ are to be allowed (cf. Sec. 5.8-7). If (5.7-10) does not hold, we repeat 
the integration with the new step size h, = n,h,, where 


Ach, \ 1/4 

n =( 7 (5.7-11) 
* AIT, | 

and / is a safety factor, 0 < A < 1, typically taken as .8. With this choice of 

h, we have that 


|T,| ~ |ygh** | = |yghanth?| ~ Ach, (5.7-12) 


so that the integration with step size h, will be satisfactory. 
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If (5.7-10) does hold, we proceed to the next stage in the integration. At 
this new stage, we must decide on a new step size h,,, and a new order. 
Since we are changing the step size freely, we must restrict changes in order 
to avoid instability. Hence, we permit a change in order from gq to g — 1 or 
q+ 1 only if g + 1 steps have been taken with order gq. If this is not the case, 
we retain the previous order q and set h,, , = h,, based on the assumption 
that y, does not vary too much as we proceed from x, tO X,41. 

If it is permissible to change the order by 1, we test whether such a 
change will be beneficial in that it will allow a choice of a larger step size. To 
determine this, we compute estimates of T,_, and T),,, that is, the local 
truncation errors had we integrated from x, with methods of orders g — | or 
q + 1, respectively. The information needed to compute these estimates is 
either already available or can easily be generated. How this information is 
computed depends on the particular details of the method used. Since we are 
only interested in the basic ideas of these methods, we shall not discuss this 
aspect of the computation. Having computed T,_, and T,,,, we compute 
N-1 and y,,, by formulas analogous to (5.7-11). We now have three poten- 
tial step sizes at our disposal, 4,-,h,, ngh,, and 4,4,h,. AS before, the 
estimated local error, taking h,,, to be any of these step sizes and using a 
method of corresponding order, is approximately equal to Aéh,,, ,. Hence, we 
choose that order which gives a maximum step size, thus defining h,,, ,. In 
this way, we step along the integration interval [a, b] until we reach b, using 
an optimal number of steps consistent with stability considerations, which 
preclude changing the order too frequently. 


§.7-2 Some Illustrative Examples 


Our main object in this section is to compare the use of two predictor- 
corrector methods, that of Milne (5.5-27) and that of Hamming (5.5-33), in 
the solution of some differential equations in order to illustrate a number of 
the points made in this chapter. 


First example The first equation we consider is 


dy 
“= —y 0)=1 §.7-13 
ma TY WO) (5.7-13) 
whose solution is Y = e *. For a value of h = .1, the results of this computa- 
tion for various values of x are shown in Table 5.3. All the computation was 
carried out on a digital computer using floating-point decimal arithmetic 
with eight significant figures. For both predictor-corrector methods, the 
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fourth-order Runge-Kutta method (5.8-46) was used to calculate the values 
through x = .4.¢ The truncation errors for the two methods are 


7. | ah? Y(n) = 2.5 x 10-7e-" Hamming 
~ l|=dhb Yn) © 1.1 x 1077e7" Milne 


(5.7-14) 


Since the solution of (5.7-13) is rapidly decreasing, we used a relative-error 
criterion in determining when to terminate the corrector iteration. The con- 
vergence factor used was 1.0 x 107°. For if 5; is less than this convergence 
factor, then with K = 1 


4x 10°’ Hamming 


22 
Wo OK <1 0-7 Milne 


(5.7-15) 


Since we are using relative error, these must be compared with the 
coefficients of e~” in (5.7-14). For Hamming’s and Milne’s method, the 
bounds in (5.7-15) are, respectively, about 7's and 35 times the coefficients in 
(5.7-14). Therefore, we conclude that if 6; is less than 1.0 x 107 >, the error 
incurred by not iterating the corrector to convergence will not be serious. 

During the early stages of the computation until x = 4.0, the smaller 
truncation error of Milne’s method results in smaller errors using this 
method. From x = 4.0 on, however, the superiority of Hamming’s method 
becomes manifest. At x = 15.0 there is a relative error of 107 > in Hamming’s 
method. Correspondingly, the relative error in Milne’s method at x = 15.0 is 
about — 25. The reason why the early good behavior of Milne’s method is 
not continued is, of course, the instability of Milne’s method for all positive 
K that we discussed in Sec. 5.5-4. (Here of /Oy = —1,s0 that K = 1.) In the 
early stages of the computation, the coefficient of r; in (5.4-26) is small 
because of the accuracy of the initial conditions generated using (5.8-46). 
During this part of the computation, the per step truncation error determines 
the error, and thus Milne’s method gives more accurate results than 
Hamming’s. But sooner or later the term in rj must predominate in the error 
in Milne’s method, thereby producing the instability which is evident later in 
the computation. In fact, late in the computation the sign of y alternates 
from one step to the next, e.g., note the entries for x = 13.5 and 14.5. The 
stability of Hamming’s method is best illustrated by noticing the quite slow 
growth in the relative error as the computation proceeds. This example is a 
good illustration of the necessity of using a stable method for any solution of 
a differential equation that is going to proceed over more than just a few 
steps in h. 

For each of the 146 steps of the computation using (5.5-33), the test of 
the convergence factor was satisfied after one application of the corrector, so 
that just two evaluations of f(x, y) were needed at each step. The instability 


+t Actually, only values through x = .3 are required to start Milne’s or Hamming’s methods. 
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of Milne’s method, however, necessitated an average of three corrector itera- 
tions per step, one near the beginning but five toward the end. Finally, we 
note that in this example the truncation error is substantially greater than 
roundoff; for a contrast, see the third example below. 


Second example Consider the equation 


dy _ 
dx > 


whose solution is Y = e*. In Table 5.4, we have the results of this computa- 
tion tabulated. Again an interval of h = .1 was used, the first four steps were 
calculated using (5.8-46), and floating-point arithmetic was used throughout 
with a convergence factor of 1.0 x 107° and a relative-error criterion. This 
time the superior per step error of Milne’s method causes the solution with 


y(0) = 1 (5.7-16) 


Table 5.4 Results of numerical solution of y’ = y, y(0) = It 


Hamming’s method Milne’s method Hermite method{ 


* y Error y 


x e Error y Error 
1 1.1051709 = 1.1051708 10E—7 = 1.1051708 10E—7 1.1051708 10E-—7 
2 1.2214028 1.2214024 40E—7 1.2214024 40E—7 1.2214024 40E-7 
3 = 1.3498588 —1.3498582 6.0E—7 1.3498582 6.0E—7 1.3498582 60E-—7 
A 14918247 1.4918238 9.0E—7 1.4918238 90E—7 14918238 90E-—7 
5 1.6487213  1.6487205 8.0E—7 1.6487206 70E—7 1.6487203 10E—6 
6 18221188  1.8221184 40E~7 1.8221179 90E-7 18221178 LOE-6 
7 2.0137527 =2.0137529 = —20E—7 2.0137520 70E-—7 2.0137516 11E—6 
8 = 2.2255409 = 2.2255418 -—90E—7 2.2255400 9.0E-7 2.2255397 12E-—6 
9 2.4596031 2.4596048 —1.7E-6 2.4596024 7Q0E~—7 2.4596017 14E-—6 
1.0 2.7182818 2.7182845 —2.7E—6 2.7182810 8.0E—7 2.7182803 15E-—6 
2.0 7.3890561 7.3890860 -—3.0E-—5 7.3890573 -—12E-—6 7.3890511 5S50E-—6 
3.0 20.085537 20.085675 -—14E—-4 20.085548 -—1.1E—5 20.085520 1.7E—5 
40 54598150 54598685 -—S54E-—4 54598204 -—S54E-—5 54598097 53E—5 
5.0 148.41316 14841509 -—19E-3 14841337 —21E—-4 14841301 15E-—4 
6.0 403.42879 403.43529 -65E-—3 403.42952 -—7.3E-—4 403.42837 42E-—4 
7.0 1096.6332 10966542 -—2.1E-—2 10966357 -—2.5E-—3 10966319 13E—3 
8.0 2980.9580 2981.0243 -66E-—2 2980.9661 -—8.1E-—3 2980.9541 3.9E-—3 
9.0 8103.0839 8103.2885 — .20 8103.1094 ~—26E—2 8103.0722 1.2E—2 
10.0 22026.466 22027.089 — .62 22026.544 -—7.8E-—2 22026432 34E-—2 
11.0 59874.142 59876025 —1.9 59874.381 — .24 59874.046 096 
12.0 162754.79 162760.39 —5.6 162755.51 — .72 162754.51 .28 
13.0 44241339 44242993 -—165 442415.51 -2.1 442412.57 82 
14.0 1202604.3 12026529 -—48.6 12026106 —6.3 1202602.0 2.3 
15.0 3269017.4 32691596 —142.2 3269035.8 — 18.4 3269011.1 6.3 


+ The notation E—7, etc., is shorthand for x 107 ’, etc. 
t The Hermite method is discussed in Sec. 5.9-1. 
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that method to be more accurate throughout the computation. For with 
of /y = 1 the magnitude of r, is less than that of ro (see Example 5.5). The 
smaller error in Hamming’s method for x = .5 to x = 9 is a result of the 
errors in the values calculated using (5.8-46). These errors are positive, as we 
see from Table 5.4, but the truncation errors of both Milne’s and Hamming’s 
method are negative (why?), so that the larger negative error in Hamming’s 
method overcomes the positive error in the initial values more rapidly than 
the error in Milne’s method. For both methods only one application of the 
corrector was needed at each step. 

The growth of the error as x increases makes this example a good one 
with which to illustrate change of interval. The quantity y, — y© grows from 
3.9 x 10°° at x =.5 to 7.9 x 10°? at x = 10.0 using Hamming’s method 
and from 4.0 x 107° to 5.9 x 107? using Milne’s method. For illustrative 
purposes in Table 5.5, we give the results of changing the interval to h = .05 
at x = 10.0. The values of y at x = 10.05, 10.1, 10.15, 10.2 were calculated 
using (5.8-46). The percentage improvement in the error is substantially 
greater for Milne’s than for Hamming’s method. The reason for this is that 
the propagated-error terms in the solution are not negligible at x = 10.0 in 
Hamming’s method, and the reduction in h (and therefore in the truncation 
error) does not prevent the continuation of this error propagation. In 
Milne’s method the error-propagation terms are very small when x = 10.0, 
and thus the reduction in truncation error has more effect on the overall 
error. 

This second example illustrates the general rule, that between two stable 
methods, the one with the smaller truncation error should usually be chosen. 
It is well to emphasize at this point that a priori, in most numerical solutions 
of differential equations, we do not know much about the behavior of df /dy 
and, moreover, that in most cases df/dy will take on both positive and 
negative values during the computation (with systems the behavior will be 
even more complex). Thus we must generally choose a method such as 
Hamming’s, which is stable for both positive and negative of /Oy, rather than 
Milne’s. 


Table 5.5 Change of interval in the solution of y’ = y, y(0) = 1 


Hamming’s method Milne’s method Hermite method 


x e* y Error y Error y Error 
11.0 59874142 59875.835 —-—1.7 59874.339 — .20 59874.043  .099 
12.0 162754.79 162759.51  —4.7 162755.29  — .5O 16275453 —_.26 
13.0 442413.39 44242639 —13.0 44241463 —1.2 44241269  .70 
14.0 12026043 1202640.2 —35.9 1202607.5 —3.2 1202602.3 2.0 
15.0 3269017.4 3269117.1 —99.7 3269025.2 —7.8 3269012.1 5.3 
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We did not explicitly use the error-estimation ability of Sec. 5.5-3 in 
either of these examples. In fact, the change of interval discussed above could 
have been done automatically using error estimation {39}. 


Third example Consider the equation 


1 

whose solution is Y = tan” * x. Again with h = .1, using (5.8-46) for the first 
four values and using floating-point arithmetic, some values for the compu- 
tation are shown in Table 5.6. Since the magnitude of the solution does not 
change greatly, we used an absolute-error criterion with a convergence 
factor of 5.0 x 107 >. In both cases, only one application of the corrector per 
step was needed. As in the second example, of /Cy is always positive, so that 
again Milne’s method does not lead to instability. Thus, again the smaller 
truncation error in Milne’s method causes that method to give higher accur- 
acy until late in the computation, when the errors in the two computations 
become almost equal in magnitude although opposite in sign. This 1s 
because, early in the computation, the roundoff error becomes a significant 
part of the total error and late in the computation the roundoff dominates. 
This occurs because tan” y gets very large as y approaches z/2, thus making 
the truncation error almost zero, so that the roundoff error, even though 
occurring in the eighth figure, is greater than the truncation error. The 
roundoff properties of Milne’s method are slightly better than those of 
Hamming’s {40}, but because of the statistical nature of the roundoff, this is 
not enough to show up significantly in the results tabulated in Table 5.6. 
Proper use of the error-estimation ability of predictor-corrector methods 
would have led us to increase the value of h when the roundoff error became 
dominant. In fact, h should always be increased in the numerical solution of 
differential equations when roundoff becomes dominant if values of the 
solution at the smaller spacing are not required (why?). 


5.8 RUNGE-KUTTA METHODS 


These methods can be used to generate not only starting values but, in fact, 
the whole solution. They are self-starting and easy to program for a digital 
computer, but these advantages do not overcome their disadvantages in 
error-estimation ability and speed relative to predictor-corrector methods. 
However, their value in starting the solution and in changing the interval is 
great. Furthermore, because of their simplicity, they are competitive in 
terms of speed with predictor-corrector methods whenever the derivative 
function f(x, y) is simple. In this case, the time spent on “housekeeping” 
operations may be greater than that spent on pure computation, so that a 
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simple algorithm performs better than a more complicated one even if more 
function evaluations are required. In addition, as we shall see, we can 
achieve an error estimate at the cost of several additional function evalua- 
tions. Hence, the second advantage of predictor-corrector methods also 
vanishes when function-evaluation time is not critical. This situation is 
particularly common with large systems of differential equations. 

The basis of all Runge-Kutta methods is to express the difference be- 
tween the values of y at x,,, and x, as 


p 
Yn+1— Va = > w;k; (5.8-1) 
t=1 
where the w,’s are constants and 


i-1 
k; = haf + ah,» Yn + y Buk, (5.8-2) 
j=1 


with h, = x,4, — x, and a, = 0. We use h, instead of h since it is possible to 
vary the interval at each stage of the use of a Runge-Kutta method. Clearly, 
given the w,’s, a,’s, and f,,’s, (5.8-2) is a single-step method for the solution of 
(5.1-1). In this section we shall again consider a single equation, but the 
extension to N equations is straightforward {50}. 

Our object is to determine the w,’s, a,’s, and B;,,’s so that (5.8-1) has the 
properties we desire. In particular, our object is to make the coefficients of h’, 
in the Taylor-series expansion of both sides of (5.8-1) about (x,, y,,) identical 
for r= 1, 2, ..., m. As we shall see, we can do no better than m = p. The 
resulting formula will be called a Runge-Kutta method of order m. For con- 
venience throughout this section we shall not distinguish notationally be- 
tween the true and calculated solutions. In particular, we shall differentiate y 
as if it were the true solution. 

The expansion of the left-hand side of (5.8-1) is 


(0.6) 


Yn+1 —~ Va = yy hy, y?/t! (5.8-3) 


t=1 


From (5.1-1) 


C dy r,| t-1 
(t) —{—- 4-77 
Yn dx't~ 1 f (Xn Yn) (2 + dx 5} f (Xn, Yn) 
0G O t-1 
= (2 445) Foray) (58-4) 


where f = f(x, y). Using this, (5.8-3) becomes 


oO hi! 0G r,| t 
Yn+1 — Va = 2. (t + vias +f ad F (Xn Yn) (5.8-5) 


t= 
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We define ; ; 
ae _ — 5.8-6 
Das Ing, Suan In) (58-6) 
and so can write, for example, 
og OP pe, y,) = DF4 LDF (5.8-7 
ax! dy| 2 On = wy | 8-7) 


where the notation |, means that all quantities should be evaluated at 
(x1 Yn) (and h should be replaced by h,). 

When (5.8-6) is used, the first few terms of (5.8-5) are 

h2 h 
hf + 5 Df + 5 (DT + LDP) 


Yn+1 — Ya = 


+ (D+ 04+ £2D/ + 3D/Df,) + %, (DY 
+ 6DfD°f, + 4D*f Df, + Df} + Dif} + 3(Df)f,y 


+ D*ff, + 7f, Df Df,) 


+ O(hS) (5.8-8) 


To get the expansion of the right-hand side of (5.8-1), we first use the 
Taylor-series expansion for two variables to write 


Flag as n+ (Bu) | = Fm DiS leas yal! (589) 


é rl G 
where D; = %; ay + (EAs) f5, (5.8-10) 


Using (5.8-9) and (5.8-10), we can get expansions for each k; on the right- 
hand side of (5.8-1). 
Since «, = 0, 
k, =h, f, (5.8-11) 


For k, we have 
k, = h, f (X, + ah, Yn + Boi ky) = hy S (Xn + Arh,» Yn + Bar ln Sn) 
=h, dn DS f(Xn» Yn)/t! (5.8-12) 
t=0 


where D, is given by (5.8-10). For k; we proceed as follows, using (5.8-11) 
and (5.8-12): 


kx =hi f (Xn + 3s Yn + B3iky + Bs2k2) 
= hy f [Xn + 03 ln, Vn + (B31 + Ba2)hn fr + Bsalk2 — a fad] 


00 6 |t 
— h, 2. h, Ds + B32(k2 ~ h, Sn) dy I (%n> Yn) /t! (5.8-13) 
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By using (5.8-12) in (5.8-13) we can consider (5.8-13) to be an expansion in 
powers of h, . The procedure for k, suggests the general procedure for k;. We 
write 


k,;=h f(s ahs Yat YB ks) 


i-1l 


=h, f\X_ + OjMa,s Va + 2 [BijAn Tn + Bilkj — hy a) 


—h, D) a D; + Y Bulk ~ fa) o| Ps Yat (5.8-14) 


and then use the results for k;, j < i to write k; as an expansion in powers of 
h,. Leaving the algebra to a problem {42}, we give the results for i = 1, 2, 3, 4, 
retaining terms through h>: 


ky = hf (5.8-15) 


n 


h? h* h? 


+ O(h?) (5.8-16) 


k, =hf + h?D; f+ h?($D3 f+ Ba2f,D2 f) + h* 303 f 


+32 6,2 f+ BaDa SDs f, +h jos s+" Paar p3 f 


+32 p B32 


D3 IDs f, + *32 fy(Da f+ 32 Da fD3 f 

+ othe (5.8-17) 

ky = hf + WD, f+ W(GDS f+ Bar f,D2 f+ Bas fDs Sf) 

+ h*[6D3 f + $Ba2 f,D2 f+ Bs2Bas(f)’D2 f+ 2Bas fD3 f 

+ Ba2D2 fDa f, + BasDa fDa fh) + h?[aaDi f+ SBa2 f,D2S 

+ $832 Bas(fy)"D2 f + Bs2Bas fyD2 SDs f, + CBs f,D3 f 

+ 4B42D4 f,D3 f+ 5BasDa f,D3 f+ 5Bo2 fyyDif 

+ Bar Bas fyyDa f Ds f + 4Bas fyD3 f+ 3Ba2D2 DES, 

+ $B43D3 fDi f, + Bas Bs2 fyD2 fDa fy) | + O(hn) (5.8-18) 
Equations (5.8-15) to (5.8-18) will enable us to develop all Runge-Kutta 


methods through order 4, and the terms in h® will facilitate discussion of the 
error term. 
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Substituting (5.8-15) to (5.8-18) and (5.8-8) into (5.8-1) and matching 
powers of h, through h*, we get 


h,: WwW, +tw,+w3+w,=! (5.8-19) 
] 
hy: W2D, f+w3D3 f+ wD, f= 5 OF (5.8-20) 


h3: 5[w,D3 f+ w3D3 f+ w4 Di f] + fi[wsB32D.f 


+ walBazD2 f+ BasDs f= 5 (DF+hDf) (58-21) 


ha: ~[w. D3 f+ w3D3 f+ wa, D3 f] + 4f,[w3B32D3 f 
+ W4(B42D3 f+ Bas D3 f)] + [ws Bs2D2 fDsf, 
+ W4(Ba2D2 f Da f, + BasDs fDz f,)] 


+ [WaBa2Baal D2 f]= yD + L,D4 
+ 3Df Df, + f2Df] (5.8-22) 


These are in reality not four but eight equations, since if the values of the 
w;s, %,’s, and B,s are to be independent of f(x, y), as they must be to be 
useful, the expressions in square brackets on the left-hand sides of (5.8-21) 
and (5.8-22), which are homogeneous in the operators, must equal the corre- 
sponding terms on the right-hand sides {42}. Moreover, if the resulting eight 
equations are to be independent of f(x, y), the ratios 


Dif j=2,3,4 and DiS, j=3,4 (5.8-23) 


Df Df, 
must be constant. This will be true if 
i-l 
a= By i=2,3,4 (5.8-24) 
j=l 
for then D,; = «,D (5.8-25) 


Finally, then, the eight equations are 


W, +w,+w3,+w,= 1 


WH. + W333 + Waky 


w,a2 + w30% + wa? 

W302 B32 + Wa(%2 Bar + &3 B43) 
w,03 + w303 + waar 

W303 B32 + W403 B42 + 03 Baz) = ae 
W307 3 B32 + Wal(%2 B42 + &3 Bas )eg = 


W402 B32 Bas = 24 


(5.8-26) 


pi Ae wh rl 
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where the first equation corresponds to (5.8-19), the second to (5.8-20), the 
next two to (5.8-21), and the last four to (5.8-22). The system (5.8-24) and 
(5.8-26) has 11 equations and 13 unknowns, which will generally be sufficient 
to determine the parameters with two degrees of freedom. We note that 
because of the last equation, it is necessary to include the k, term in order to 
achieve accuracy through h* (why?), thus verifying that p must be at least m. 
For the cases m = 2 and 3, we can also show that p = m. The general result 
that p > m is not hard to prove {43}. Since a treatment of the cases m > 4 is 
quite involved, we shall be considering in detail only the cases m = 2, 3, 4. 
First, however, it will be convenient to consider the errors in Runge-Kutta 
methods. 


§.8-1 Errors in Runge-Kutta Methods 


Here alone in our discussion of Runge-Kutta methods 1s it important to 
consider a single differential equation. For systems of equations, the algebra 
becomes intractable. 


Truncation error Eqution (5.8-1) is to be exact for powers of h, through h’. 
Therefore, the truncation error T,, can be written 


T,, = Ymh™** + O(n™*?) (5.8-27) 


where, of course, both y,, and T,,, really depend on f(x, y). To estimate T,, , we 
are forced to consider only y,, because consideration of the higher-order 
terms is algebraically intractable. The bounds on j,, that we shall obtain will 
be very conservative; i.e., the true magnitude of y,, will generally be much 
less than the bound. Thus, if the O(h™*?) term is small compared with 
Ymh™* 1, as we expect it will be if h, is small, then the bound on y,,h”*? will 
usually be a bound on the error as a whole. 
Using (5.8-8), (5.8-15) to (5.8-18), and (5.8-25), we calculate 


1 a3w, 


y= (: _ 722) Df + Lf Df (5.8-28) 


1 1 
73 = i — 31 (02 W2 +a}w3)|D°p 


1 1 3 
+ (jy— 3423Bsaws) 4.09 + (5 — oats Baa wa) DIF, 


1 
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1 W203 + w305 + wag) 4 
Y= (555 a |r 
1 so w3 003 B32 + wecales Pas + “Fes py, pf 
1 
4 fe _ W3 B320503 + Wat4(Ba2 X3 + B33) Df, Df 
30 2 
I We Pas Bsa 2n2 
——_~ D 
+ (= fyDf 
1  w383203 + w3 3202 + Wa(Bas: a3 + Ba2%)? 2 
+ ao y SyPS 
W3 P3203 + Wa(Ba303 + Bar 03 
+ |p _ W3P32%2 f 4303 42 | DY 


+ [a0 — Wa Baz Bs2%2(a3 + 4), Df,Df + r0f;Df — (5.8-30) 


In order to bound y,,, we assume the following bounds for f(x, y) and its 
derivatives in a region R about (x, y,) containing all points in (5.8-2): 

ai tif +j 
dx! dy! < gr Mi-1 


[f(x y)| <M itj<m (58-31) 


the latter being chosen because it leads to the convenient forms below. Using 
these bounds and (5.8-6), we can, for example, bound D?f as 


Of 9 a 


<4M-- 
Ox Oy M 


+i 


Oo? 
D4 =| 55 +, 


Bounding the other derivative terms in 68-28) to (5.8-30) similarly and 
using a, = 1, we get {44} 


1 «aw, 1 5 
7 5 5.8-32 
ly3| < [8 laa — G(a3w2 + 03 W3)| +4|sa- 503 B32 W3| 
+ 4|§ —0203832W3| + ML (5.8-33) 


Iya] < (16[b,| + 4/b2| + |b, + 3b3] + [2b2 + 3b5| 
+ |b, +b;| + |b3| + 8]b4| + [bs| + |2b5 + 55 | 
+ |bs + be + b7| + |be| + [266+ b3| 
+ |b7| +2|bg|/)MLe (5.8-34) 
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where b, = 735 — aelagw2 + aZws + wz) 
by = 26 — 2[%2%3B32W3 + (02842 + %3B43)Wa] 
by = 730 — 61%2B32W3 + (02 Ba2 + 0&3 Bas)Wa] 
36 — 214303 B32W3 + (03 B42 + 03 B43)Wa] (5.8-35) 
bs = 735 — 203832 B43 Wa 
bs = a6 — 2[%2 B32 Ws + (%2 Bar + %3 Bas)?Wa] 
by = 735 — %2(1 + %3)B32Ba3 Wo 


We have noted that, for m = 4, the system (5.8-26) is underdetermined. 
As we shall see, the corresponding systems for m = 2 and 3 are also under- 
determined. In all three cases the extra parameters can be used to minimize 
the above bound on y,,, as we shall indicate in Secs. 5.8-2 to 5.8-4. 


Propagated-error bounds If Runge-Kutta methods are used for the complete 
solution of (5.1-1), bounds on the propagated error will also be important. 
We shall content ourselves here with stating without proof [see Galler and 
Rozenberg (1960)] the following theorem on propagated error bounds, 
which covers many of the specific Runge-Kutta methods we shall consider 
[but cf. (5.8-46)]. 


Theorem 5.3 If w; > 0,0; > 0, 8,,>0(i=1,...,4,7=1,..., 3 ij # 42), 
B42 <0, if f /dy is continuous, negative, and bounded from above and 
below in a region D in the xy plane: 


0 
<f —M, <0 


—M, ay 


if the maximum error (truncation and roundoff ) committed in any step 
is less than E in magnitude, and if the solution remains in a region D*, 
approaching no closer to the boundary of D than Qh + |e,;|, where 
Q = max, yep f(x, y), then the total error at the ith step ¢; satisfies 
the inequality 


where h is constant for all steps and must be such that 
h< min( Mi 
M3” M3 — 2Wg Bar M3 — 2Wa Bar %. MM, 


This indicates the difficulty of obtaining definitive results about Runge- 
Kutta methods. The bound can be expected to be very conservative. 
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Roundoff error We could choose the free parameters to minimize the round- 
off in (5.8-1), Le., to make the w; as nearly equal as possible, but, as with 
numerical integration methods, roundoff is generally not significant 
compared with truncation, and so we shall ignore it when choosing the free 
parameters. 


5.8-2 Second-Order Methods 


For the case m = 2 the system (5.8-26) retains only the equations pertaining 
to h?. These, together with (5.8-24) for i = 2, are 


Wi + W)>= l XW =4 Boy = 4 (5.8-36) 


Three second-order methods of interest correspond to «, = 5, 2, 1 for which 
(5.8-1) becomes, respectively, 


Yn+1— Yn =h, f (x, + $h,, Yn + Sh, Sn) (5.8-37) 
Yaoi — Vn = aS Xn» Vn) + 36 (Xn + Fhns Vn + Sha Sy)] — (5-8-38) 
Yaoi — Yn = DLS (Xn Vn) +S (Xn + ns Vn + bn Sa) (5.8-39) 


When f(x, y) is a function of x only, the method (5.8-37) corresponds to the 
midpoint rule and (5.8-39) to the trapezoidal rule. Equation (5.8-38) is that 
for which the bound on y, in (5.8-32) is minimized {46}. The bounds for 
(5.8-37) to (5.8-39) are, respectively, SMI’, 4ML’, and 4ML’. 


§.8-3 Third-Order Methods 
The equations are 


witwr,+w3;=1 @2.w,+43;Ww3=4 12a3w,t+a3w,=4 
%2B32W3=% a2 = Br “3 = B31 + Bap (5.8-40) 
a two-parameter family, which can be written 
wp = be 2 Ble + 48) 
6a, 03 
6a,(a%3 — a2) 
2 — 3a X> # X3 
= ——_—___"_ 5.8-41 
"3 6a3(03 — a2) t2,%; #0 ( 
Bo = %2 #4 
3a203(1 — a2) — 03 
B3, = —— > 
a%2(2 — 302) 


_ a3(%3 — a2) 
Ps2 = %2(2 — 3a) 
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The cases a, = a3 and a, or a; = 0 we leave to a problem {45}. Two third- 
order methods of interest are 


Yn+1— Yn =o a + $k; 


ks = h, ee + Ve Vn + 4k1) (5.8-42) 
ks = hy f (xq + 3hns Ya + 3h) 
Yn+1 — Yn = 6(ky + 4k + ks) 
ky =hy f Xn» Yn) 
ky = yf (q+ Mas Yn + 4k) (5.8-43) 
ks = Daf (n+ Ins Yn — ky + 2k) 
When f(x, y) is a function of x only (5.8-43) is Simpson’s rule. The method 
(5.8-42) is that for which the bound on y, is minimized {46}. For the methods 
(5.8-42) and (5.8-43), the bounds are, respectively, ML and 43ML. 
5.8-4 Fourth-Order Methods 
The two-parameter system (5.8-26) can be solved to give 


we wt 4 Lo loa + 43) wa 2a, —1 
2 120,05 * 12a3(a3 — @)(1 — a) 
wie 1 — 2a, 7 wae 2(a, + a3) —3 
> 1203(a3 — a%)(1 — a3) “2 12(1- a)(1 — a3) 
a3(«3 — %) 
— 23 a =| 5.8-44 
Bs» 20(1 — 2or) Xe ( ) 


(1 — a)[a, + 03 — 1 — (203 — 1)’] 
2a3(a, — a,)[60.0, — 4(a, + 03) + 3] 
g,, = (= 2ata)(I = a2)(1 ~ 9) 
a3(%; — a2)[6a2.0, — 4(a, + 3) + 3] 
except when a,, a3, = 0,a0,,a3 = l,a, = a3 or the denominators of £3,, B42, 
or B43 vanish. These special cases are considered in {47} and {48}. The most 
commonly used fourth-order Runge-Kutta method is one in which 


&. = a3 = 4 and its equations are 
Ya+1 — Vn = 6(ky + 2k2 + 2k3 + ka) 


ky = hy f (Xn> Yn ) 

k, — h, F(X, + gh, Yn + + 5k, ) (5.8-45) 
( 
( 


Bar — 


k3 = hy f(X_ + 3hns Yn + 3k2) 
ka = Ia f (Xn + Ans Vn + Ks) 


218 A FIRST COURSE IN NUMERICAL ANALYSIS 


The method for which the error bound on y, is minimized corresponds to 
a = 4,0,=%- Ps /'5 and has the equations 


Vor — Yq = -17476028k, — .55148066k, + 1.20553560k, + .17118478k, 
ky = hy f (Xn Yn) 
ky =h, f(x, + -4hias Va + -4k1) (5.8-46) 
k, =h, f(x, + .45573725h,, y, + .29697761k, + .15875964k,) 
ka = My f (%q + Mas Yq + .21810040k, — 3.05096516k, + 3.83286476k3) 
The error bounds on the methods (5.8-45) and (5.8-46) are, respectively, 
J3MI* and 5.4627 x 10°?MI’. A third method which is useful for error 
estimation (see Sec. 5.8-6) has the equations 
Yn+1 — Vn = 6(ki + 4k3 + ky) 
ky = hy f (Xn; Yn) 
ky =hy f (Xn + ohn Vn + 3k1) (5.8-47) 
k; = h, f 
ka = h, f 


(x, + thy, Yn + ak + ak») 
(Xn + hn s Yn — k, + 2k3) 


§.8-5 Higher-Order Methods 


For m > 4, it turns out that p > m. Thus, to achieve fifth-order accuracy, six 
stages are necessary and for sixth-order, seven is the minimum number 
required. For m> 7, it has been shown that p,,>m-+ 2. This does not 
detract from the usefulness of higher-order methods, since under proper 
conditions of smoothness the use of such methods allows one to take a much 
larger step length h and thus cover the interval of integration using fewer 
function evaluations. Since the algebra involved in computing the formulas 
for such methods is formidable, it is only recently that these methods have 
been generated using computerized algebraic-manipulation systems. A wide 
variety of such methods exist since there are many free parameters in the 
governing equations. One of the considerations in the choice of these par- 
ameters is that of absolute stability, which we discuss below; another 1s 
connected with local error estimation. As an example of such methods, we 
give the following equations for a particular fifth-order method which has 
certain valuable features considered below. 
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Vn+1 — Yn = 24k, + ook, + Boks + 332k, 


ky =hy f (n> Yn) 

ky = hy f (Xn + ha» Yn + 3k1) 

k3 = haf (Xn + has Vn + 4k, + 4k) (5.8-48) 
kg =hyaf (Xn + Mas Va — Ky + 2k3) 

ks = hy f (Xn + $hn» Yn + 25k, + 35k. + syka) 

ke = Ia f (Xn + 3Mns Vn + Sask — 3ke 


+ 358k; + gysk, — 338ks) 


5.8-6 Practical Error Estimation 


The error bounds derived in Sec. 5.8-1 are not very practical, since it is very 
difficult to find values of Land M for which (5.8-31) holds. Consequently, we 
seek other ways to estimate the local truncation error T,, or more specifically 
Ym, Where, we recall from (5.8-27), 


Tr = Ymbn*! + O(hn* 7) 


One way, similar to that used in adaptive-integration schemes (cf. Sec. 4.11), 
is to integrate over two successive intervals with the same step size h,, then 
integrate over the double interval with step size 2h, and compare the results. 
In the first integration, we have 


Yn+1 ~ Y(Xn+1) + Ymlin 


Yn+2 ~ Y(Xn42) + 2y nh? 


where we have assumed that y,, does not vary much over the interval 
[Xn» Xn+2] and that we have started with the exact value Y(x,,) at x,,. There is 
a further assumption that using y,., in place of Y(x,4,) in going from x, 4, 
to x,4 2 does not affect the form of the local truncation error. If we now 
integrate directly from x, to x, with step size 2h, to yield the value j,,, 5, 
we have that 


Vn+2 © V(Xn42) + Ym(2h,)”* * 
Hence yni2— Yn+2 © Vm(2 — 2"**)h™*?, so that 


m © Ym lg me ; (5.8-49) 


Thus, if the method requires p,, function evaluations, at the cost of p,, — 1 
additional function evaluations (because k, is the same for both step sizes) 


220 A FIRST COURSE IN NUMERICAL ANALYSIS 


every two steps, we can monitor the calculation to see whether the local 
truncation error remains acceptable. Furthermore, as we shall see, this est1- 
mate of 7,, can be used to determine a good value of the step size h,, 5. 

A second method for estimating T,, is to compare the value of y,,, , with 
a second value jy, , , computed using a method of order m + 1. For then we 
have, assuming again that y, = Y(x,,), 


Vn+1 ~ Y(Xn+1) + Ym+ hint? + O(hm* >) 
so that yaar — Voor © Vm! + O(h™*?) & T,, (5.8-50) 


In general, this is a very expensive way to estimate T,, since it requires 
Pm + Pm+1 — 1 function evaluations per step. However, by a clever choice of 
the parameters f;; it is possible, for some values of m, to determine a method 
of order m + 1 in which a method of order m is embedded; that is, k;, i = 1, 
+5 Dm» are the same for both methods. 

Pm+1 Pm 
Then Vn+1 = Vn + y wk; and Yn+1 = Yn + 2 Wik; 

i=1 i= 
Thus, at the cost of p,,4, — p,, additional function evaluations, we get an 
estimate of the local truncation error. The fifth-order method given in 
(5.8-48) is an example of this, since the fourth-order method (5.8-47) is 
embedded in it. Thus, (5.8-47) is a fourth-order method with the local 
truncation-error estimate 


T, = 4k, + 4k. 4+ dek, — ok, — 442k, (5.8-51) 


which results by formally subtracting the first equation in (5.8-48) from the 
first equation in (5.8-47). 


§.8-7 Step-Size Strategy 


Once we have an estimate of the local truncation error in one or two integra- 
tion steps, we can use this information to decide whether to accept or reject 
the computed values and to decide on a new value of h to be used either to 
repeat the computation or to continue. If the step size is changed, the choice 
of the new h should be such that the estimated local truncation error with 
the new value will be less than the allowable error over the new step, [x, 
x + hj], where x = x, in case the computation is repeated and x = x,4, OF 
X = X,4 2 otherwise. The details are as follows. 

Let us assume that we are integrating over the interval [a, b], where 
H = b — a, and that we want the total error at x = b to be less than «. We 
now make the assumption that the total error equals the sum of the local 
truncation errors over all subintervals. This assumption may not hold when 
| f,| is large, for then small changes in y cause large changes in f(x, y) and 
there is a considerable amount of propagated error. In such cases, the best 
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we can do is to work with a tolerance ¢ which is smaller than the accuracy we 
desire in the hope that the deleterious effect of error propagation will be 
controlled in this way. Returning now to our original assumption, we see 
that if we integrate over the subinterval [x,, x, + h,], we require that 
| T”| < ch,/H = th,, where @ = ¢/H, and where T® is the local truncation 
error over [x,, x, +h,]. If our estimate of | 7") | is < ch, after one step of 
length h,,, or if our estimate of | T)| + |T%*"| is < 2éh, after two steps of 
total length 2h,, we accept the results and continue from x,,, OF X,4 , 
respectively. Otherwise, we must recompute from x, with a smaller value of 
h,. A new value of h, may also be desirable even when we accept the results 
of the computation, as we shall shortly see. 
In either case, we have 


(n) w m+1 
Tin ~ Vm Mn 


If |T“| > éh,, we must choose h,, the new value of h, so that for the new 


local truncation error, |TY|<éh,. Since T® ~y,,hm™*! and 
Vm © TO/hn**, we require that 
| Tm’ | 
m+1 = 
“pmed h” < ch, 
n 


from which we deduce that any choice h, such that 


_ cymt } 1/m 
hn < him (Tha 

will be satisfactory. However, to be on the safe side, we choose h, = Bh. 

Assuming that y,, does not change much as we proceed from one inter- 
val to the next, we see that even when we accept the computation at x, + h, 
or x, + 2h,, we can use the same reasoning as above to choose the next step 
length h,,, or h, 4. as equal to .8h. However, we must make the proviso that 
this new value be neither too large nor too small, both in an absolute sense 
and relative to the previous value of h. When the proposed value of h is not 
within some preset bounds, we take the new value of h as the extreme point 
of the permissible interval of values of h. 


58-8 Stability 


Since the Runge-Kutta methods are single-step methods, it is meaningless to 
apply the definition of relative stability to them since there is only one root 
to the (generally) nonlinear difference equation (5.8-1). However, we can 
define absolute stability for these methods in terms of equation (5.4-1) with 
K > 0. We know that the solution of this equation goes to zero as x > 00. 
Hence, it is only natural to require that the numerical solution should also 
have this property. Assume a constant step size h. We can then define a 
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Runge-Kutta method to be absolutely stable for a certain value h = hK, 
K > 0, if the numerical solution using this particular method applied to 
(5.4-1) goes to zero as n > 0. 


Example 5.7 Apply the Runge-Kutta method of order 2, (5.8-39), to (5.4-1). 
We obtain 


h h? 
Yuet =n 5 [—Ky, — K(y, — hKy,)] = (1 —-hK + > K’}y, 


so that, by induction, y,,, =[1 — hK + (h?/2)K?]}"*'y,. In order that y, +0 as n— 0 
for yo #0, it is necessary and sufficient that [1 — h + (h?/2)] < 1, from which it follows 
that, for 0 < h < 2, the method (5.8-39) is absolutely stable. In fact, for any second-order 
method using two function evaluations, the interval of absolute stability is (0, 2). 


In general, for a p-stage method of order m, we have that {52} 

Yn+1 =TYn> where 

palo ht YP +5 (- hy" + y yi( — hy’ (5.8-52) 

m! t=m+1 

and the coefficients y, depend on the constants of the particular method. If 
p =m, the value of the sum 1s zero, so that for all methods of order m for 
which p = m, the interval of stability is the same. Thus for m = 3, the interval 
s (0, 2.51), and for m = 4, it is (0, 2.78). For m > 4, p > m, and one of the 
criteria in choosing the constants of such a method is to maximize the 
interval of absolute stability. 

For the general equation y’ = f(x, y), we approximate K by —f,(x, y) at 
some point in the interval (x, x + h). For a single equation, we can approxi- 
mate f, quite easily provided two different values of k use the same value of a. 
Thus if 


k,=h JS (x. + tthy, ye + ¥ Bek ks) 


j=1 
s-—1 
and k, = bat (x. + ah,,y,+ > pk 
j=l 
k,—k 
then K xf,(x + ah,, J,) ¥ => wanE ae 
2 Besk jo 2, Baik J 
j= 


where jy, is some intermediate value. We see from this that another criterion 
in choosing the free parameters of a method is that «, = a, for some pair of 
indices 0 < r <s < p. This occurs, for example, in the classical fourth-order 
method (5.8-45) as well as in the fifth-order method (5.8-48). 

We stress that this estimate can only be obtained for a single equation. 
However, in general, the problem of stability does not arise in the integra- 
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tion of a single equation since accuracy considerations usually demand a 
choice of step size h which keeps the method stable. The problems arise in 
systems of equations. Here, the test system has the form 


y =Ay = y(xo) = Yo 


where y is ann vector and A an n x n matrix. The situation corresponding 
to K > 0 for the single equation (5.4-1) is that all eigenvalues A; of A have 
negative real part. A method is then absolutely stable for all values of h such 
that h; = |hd,;| <1 when Re A, <0 for all i. We shall return to this in 
Sec. 5.10 on stiff equations. 


5.8-9 Comparison of Runge-Kutta and Predictor-Corrector Methods 


In order to compare Runge-Kutta with corresponding-order predictor- 
corrector methods, we note the following points: 


1. Runge-Kutta methods are self-starting, the interval between steps may be 
changed at will, and, in general, they are particularly straightforward to 
apply on a digital computer. 

2. They are comparable in accuracy (often more accurate) than 
corresponding-order predictor-corrector methods {49}. However, if we do 
not monitor the per step error by using additional function evaluations, 
we shall generally be required to choose the step size h conservatively, 1.e., 
smaller than is actually necessary to achieve the desired accuracy. 

3. Further, they require a number of evaluations of f(x, y) at each step at 
least equal to the order of the method. As we have seen, predictor- 
corrector methods generally require only two evaluations per step. Since 
evaluation of f(x, y) is usually the most time-consuming part of solving 
(5.1-1), this means that predictor-corrector methods are generally faster 
than Runge-Kutta methods; eg., fourth-order predictor-corrector 
methods are nearly twice as fast as fourth-order Runge-Kutta methods. 

4. Finally, monitoring the local truncation error does not involve any addi- 
tional function evaluations using predictor-corrector methods, whereas it 
is quite expensive for Runge-Kutta methods. 


On a digital computer reasons 2 to 4 are much more compelling than 1, 
and thus predictor-corrector methods are the indicated methods to use, 
except when f(x, y) is simple to compute, in which case, point | becomes 
important. 

The self-starting characteristic of Runge-Kutta methods makes them an 
ideal adjunct to the usual predictor-corrector methods for starting the 
solution. Since they will be used for only a few steps of the computation, 
truncation error and not stability is the key consideration. Therefore, for this 
purpose, the minimum-error-bound Runge-Kutta methods should be used. 
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The methods we have derived may be used for systems of equations {50}, 
although the error bounds were derived for single equations. It is reasonable 
to assume, however, that methods which are best in this sense for single 
equations will be at least nearly best for systems. 


5.9 OTHER NUMERICAL INTEGRATION METHODS 


5.9-1 Methods Based on Higher Derivatives 


In this section we shall use derivatives higher than the first in formulas for 
the solution of first-order differential equations. One example of such a 
formula is readily derivable from the Euler-Maclaurin sum formula in the 
form (4.13-17). Replacing f(x) with y’'(x), we get, after making some simple 
changes in notation, 

n-1 


h / / 
Vat =\i th y Yn-i — 5 (Vast + y}) 
i=-1 


m B _ 
— Za OA — 0) (5.9-1) 
k=1 
with the truncation error given by (4.13-16) 
nh2™t 2B nt . 
"="Gmear to) (59-2) 


Our main interest in this section, however, is in a class of predictor and 
corrector formulas which use the second as well as the first derivative. The 
corrector formulas are generated by taking the Hermite interpolation for- 
mula (3.7-17) (or the modified Hermite formula) based on the points x,4 4, 
Xn» +++) X,—p and integrating between x, and x,,, after replacing f(x) by 
y'(x). The result of this is 


oXn+1 


Pp 
Yn+1 = Yn + y | h, - i(x) dx i 
i=-1[° 


Xn 


Pp Xn+1 
+y | PP Fa) dy (59-3) 
The truncation error is given, using (4.4-12), as 


_ YRPF Ng) pane a 5 
T, = Qp+4)! [ [(x — Xne1) 7 ( — Xp) ax (5.9-4) 


With p = 0 in (5.9-3), we get the fourth-order corrector 


h , , h? ut “t 
Yast =Vnt 5 (Yn 1 + Yn) + 55 (—Yne 1 + Yn) (5.9-5) 
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which has an error term 


h> 

T, = 35 Y'(n) (5.9-6) 

The stability equation for (5.9-3) has only one root, ro = | at h=0. 
Therefore, for some range of values of Kh, this class of methods is stable. In 
particular, (5.9-5) is stable for all values of Kh {54}. Moreover, note that the 
truncation-error term (5.9-6) is substantially smaller than that given by any 
of the correctors in Table 5.2. Thus, if the second derivative of y can be 
calculated easily, (5.9-5) may indeed be a good choice for a corrector. 
Indeed, when (5.1-1) is a single equation, the second derivative is given by 


n_ 4 of of 
y ~ dx (x, Y= ayy T 3y (5.9-7) 


It is often true that, having computed f(x, y), the calculation of the partial 
derivatives of f(x, y) can be done very quickly on a digital computer, in 
which case the second derivative as given by (5.9-7) is indeed easily calcul- 
able. Of course, for systems of equations, the calculation of the second 
derivative becomes substantially more complicated. 

To get a class of predictors, we use a Hermite formula based on the 
points x,,..., X,—, and proceed as above {53}. A convenient predictor to use 
with (5.9-5) is 


h h? i ti 
Vat = Yn t 5 (In + 3Yn—1) + 75 LTV + Tyn-1) (5.9-8) 
which has a truncation error 
31h> 
T, = 99 Y"(n) (5.9-9) 


The set of equations analogous to (5.5-27) and (5.5-33) is 


h 
Predictor: Yur = Yat 5 (Yn + 3Vn-1) 


h? 
aan 17 ” 7 a 
+ 13 | Yn + Yn 1) 


Modifier: Yor =Ynt1 + 30(Yn — Ye) (5.9-10) 
(WO) =f (Xn a Hs) 
| h... 
Corrector: yr Pay, + 5 (ys) + yi] 


h? } ” ” ° 
ta5l-Onii)” +m] f= 01... 
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Example 5.8 Use (5.9-10) to perform the same calculations as in Sec. 5.7-2 on the same three 
examples. 

The results for these calculations corresponding to those of Sec. 5.7-2 are given in the 
last two columns of Tables 5.3 to 5.6. Since the Hermite method is stable and has a smaller 
truncation error than either Hamming’s or Milne’s methods, we would expect it to give 
better results than Hamming’s method for the differential equation (5.7-13) and better re- 
sults than both Miine’s and Hamming’s methods for Eq. (5.7-16), and in fact it does. The 
importance of roundoff is again in evidence in the solution of the differential equation 
(5.7-17), where the Hermite method is initially best, but by the end of the computation, 
the errors in all three methods are similar in magnitude. In all the computations of this 
section, the same convergence factors were used as in the examples of Sec. 5.7-2, and in all 
cases one application of the corrector at each step was sufficient. Thus, when the second 
derivative is easy to calculate (it is very easy for the first two examples considered here and 
not quite so easy for the last), the Hermite method is to be recommended over Hamming’s 
method. 


5.9-2 Extrapolation Methods 


Consider the Euler, or point-slope, method for solving a differential equa- 
tion. It does not require any additional starting values, and the solution 1s 
given by the recursive formula 


Yntt = Vn thf (Xn Yn) = Yo=WXo)  =on=l,...,N (59-11) 


The sequence {y,} depends on the step size h, so that we can denote the 
computed solution by y(x; h); the domain of y(x; h) is the set {x,}, where 
Xn =Xq +t nh, n=0,1,..., N. For this function y(x; h) it can be shown that 


y(x; h) = ¥(x) + c,(x)h + c2(x)h? +-++ + 0,(x)h? + O(h?*")  (5.9-12) 


To apply Richardson extrapolation (Sec. 4.2) to this problem, we choose a 
sequence of pairs (h;, N;) such that x = xg + h; N; is a constant, calculate 
y(x; h,) for each pair, and apply formula (4.2-10) with y = 1. We shall not 
pursue this case further since there exists a more efficient integration method 
yielding an expansion like the above but containing only even powers of h. 
This gives better results using extrapolation. The method is based on the 
midpoint formula (5.5-16). Used by itself, the midpoint formula requires two 
starting values and also exhibits an oscillating error due to the fact that y,. , 
is coupled to y,_, and not to the immediately preceding value y,, which 
only enters via the derivative. On the other hand, it 1s more accurate than the 
Euler method in that the local truncation error is O(h*) and the global 
error is O(h*), while in the Euler method, the errors are O(h’) and O(h), 
respectively [cf. (5.4-32)]. In order to overcome the oscillating error, we 
introduce a damping process to get the modified midpoint method of Gragg. 
Set 


Yi =Yo + hyo 
Vne+1 = Vn-1 + 2hy,, n=1,2,...,.N; N even (5.9-13) 


dv = a(¥n—-1 + 2yn + Yn 41) 


THE NUMERICAL SOLUTION OF ORDINARY DIFFERENTIAL EQUATIONS 227 


If we set x = xo + Nh and y(x; h) = py, it can be shown that 

y(x; h) = Y¥(x) + cy(x)h? + c2(x)h* + °°: (5.9-14) 
Hence, the convergence of Richardson extrapolation should be quicker if we 
calculate y(x; h;) for a sequence of pairs (h;, N;), x =X 9 +h; N;, and now 
apply (4.2-10) with y = 2. 


Example 5.9 Apply (5.9-13) and Richardson extrapolation to the differential equation 


y= y*, y(0) = 25. | 
Starting with h, = 4 and setting h, = 4h,_,, we calculate with N, = 2': 


h N y(t h) = To T, T; 
5 2 332252741 
333314614 
25 4 333049146 33333 3213 
33333 2051 


125 8 33326 1325 


The exact solution of the differential equation is y(x) = 1/(4 — x), so that Y(1) = 4. The 
extrapolated solution is correct to six figures, and the error is $5 times the error in 
y(1; 125). 


Two general strategies suggest themselves in using extrapolation 
methods. Suppose we wish the solution at a set of points x9, x1, X2,... such 
that x,., — x, =H and we use the modified midpoint method. Then to 
proceed from x, to x,,4 1, we take a succession of values of h, which might be 
the usual Romberg sequence H/2, H/4, H/8, ... (as in the example above) or 
a sequence such as H/2, H/4, H/6, H/8, H/12, ..., which, as we noted at the 
end of Section 4.10, is more efficient for quadrature. In any case, N = H/h 
must be even. We then apply (4.2-10), as in the example above, until we have 
a satisfactory extrapolated value of x,,, ,. Then we proceed to x, with the 
accepted value at x,,,, aS initial value. This so-called active extrapolation is 
a single-step method as defined in Sec. 5.1. 

Another possibility is passive extrapolation. This can best be il- 
lustrated using the trapezoidal rule 


h 
Vat = Yat 5 (Yn + Yast) (5.9-15) 


If we assume that y,,, satisfies this equation, i.e., that we have iterated to 
convergence, and if we integrate our differential equation over the entire 
integration interval to yield the function y(x; h) defined on the set {x,} as 
previously, then the error at every grid point x,, n= 1, 2, ..., N, satisfies 
(5.9-14). Thus, after integrating over the entire interval with various values of 
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h;, h; = H/2',i = 0, 1,2, ..., or some other sequence h; with hy = H, where H 
is the spacing we are interested in, we can extrapolate at each of the basic 
grid points x; = Xo + jH. While this method gives very accurate results for 
difficult problems, it is very costly in time because of the requirement that we 
iterate to convergence even when the value of h; is not small enough to 
ensure rapid convergence of the iterations (5.5-3) or alternatively use a 
different method to solve the nonlinear equation (5.9-15). 


5.10 STIFF EQUATIONS 


Although the concept of stiffness is usually associated with systems of first- 
order differential equations, we shall first illustrate the problem with a single 
equation. We shall then discuss the general situation and one of the many 
proposed solutions to the problem. 

Consider the differential equation 


y’ = 100(sin x — y) y(0) = 0 (5.10-1) 
The exact solution is 
_ sin x — 01 cos x + Ole™ 100x 


y(x) = 10001 — (5.10-2) 


Since the exponential term is less than 10~° for x = .1, one would expect 
that a step size of h = .1 would be possible for x > .1. However, if we com- 
pute with h = .03 using the Runge-Kutta method (5.8-45), we find that 
y(3) = 6.7 x 10'!, whereas with h = .025 we get y(3) = .150943, which is a 
good result. The problem here is that while the component .0le~ '°°* does 
not contribute anything to the solution after a short interval, 1.e., it behaves 
like a transient, nevertheless, it influences the stability interval throughout 
the computation. Since the stability interval is measured in units of 
h = h|df/oy| and of/dy = — 100, we see that we need a very small value of h 
to keep h reasonable. For the method given by Eqs. (5.8-45) h must be in the 
interval (0, 2.78) for absolute stability. This translates into an interval 
(0, .0278) for h, which explains why the value h = .03 causes instability while 
h = .025 still gives a stable computation. 
The situation for systems is similar. For the system 


yin — 1 (x; yl, oe) ym) 
Lecce cece eee eeeeeeeeneneees (5.10-3) 


or equivalently y = f(x, y) (5.10-4) 
where y=([y"),..., yoy? (5.10-5) 
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h = hp(J), where J is the Jacobian matrix 


i= 1s ij=l,...,m (5.10-6) 


and p(J) is the spectral radius of J, which equals the maximum of the 
magnitudes of the eigenvalues of J (see Sec. 1.3). Hence, if p(J) is large, h will 
have to be very small to ensure absolute stability. 

In order to determine when absolute stability is essential, we consider a 
special case of (5.10-4), namely the system of inhomogeneous linear equa- 
tions with constant coefficients 


y' = Ay + (x) (5.10-7) 


where (x) is a vector-valued function of the independent variable x and we 
assume that the m x m matrix A has distinct eigenvalues. In this case, the 
system (5.10-7) has a general solution of the form 


y(x) = Dae + W(x) (5.10-8) 


where the A,, k = 1, ..., m, are the (distinct) eigenvalues of A and the z, are 
the corresponding eigenvectors. The c, depend on the initial values of the 
system (5.10-7). Now, if Re A, > 0 for some k, then generally y(x) > oo as 
x — 00, so that absolute stability is not relevant. However, if Re 4, < 0 for all 
k, then the term 


> c,e*"*2z, 70  ask—oo (5.10-9) 
k=1 

so that absolute stability becomes important. (The case Re A, = 0 corre- 
sponds to an oscillating term. If |A,| > 0, the oscillation is very rapid; this 
case is the subject of much current research activity. We shall not treat the 
oscillating case here.) When (5.10-9) holds, the first term in the solution 
(5.10-8) is called the transient solution and the second term (x) is called the 
steady-state solution. 

Let 4, and A, be two eigenvalues of A such that 


[Re A,| < |Red,| < |Rea,|  k=1,...,m 


If we are interested in finding the steady-state solution W(x), we must inte- 
grate (5.10-7) until the slowest-decaying exponential in the solution, e*, is 
negligible. (Recall that Re 4, < 0 for all k.) Thus, the smaller |Re 4, | is, the 
larger the integration interval will be. On the other hand, because of stability 
considerations, the size of the integration step is determined by p(A), which 
is greater than or equal to |Re A,|. Hence, if |Re A,| > |Re A,|, we have to 
take an excessively large number of steps to cover the integration interval of 
interest in order to find the steady-state solution. This situation is called 
stiffness. We summarize the above discussion with the following definition. 
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Definition 5.3 The linear system y’ = Ay + (x) is said to be stiff if 


1. Red, <0,k=1,...,m. 
2. max {Re d,| > min |Re 4,| 


l<k<m 1<k<m 


where the A,, k = 1, ..., m, are the eigenvalues of A. The ratio 


max |Re A, | 
lsksm 
min |Re A, | 
l1<k<m 


is called the stiffness ratio. 


For the nonlinear system y’ = f(x, y) stiffness is determined by the 
eigenvalue structure of the Jacobian J, so that it depends on the solution y 
and ultimately on the independent variable x. We say that the system (5.10- 
4) is stiff in an interval / if for x € I the eigenvalues of the Jacobian J satisfy 
conditions | and 2 in Definition 5.3. 

If the system (5.10-4) is a model of a physical system, stiffness means that 
there are processes in the physical system described by the differential equa- 
tions with significantly different time scales (or time constants). The usual 
numerical methods are inadequate since they require that we take account of 
the fastest process even after the corresponding component has died out in 
the exact solution. The only satisfactory methods are those which are abso- 
lutely stable in a good portion of .V, that part of the complex plane with 
negative real part. 

The trapezoidal rule 


h 
Yn+1 = Yn + 5 [Ens Yn) + Fret Yor a) (5.10-10) 


is absolutely stable in all of VY. However, in applying it, we must make sure 
that this equation is satisfied exactly to ensure stability; i.e., we must iterate 
to convergence. However, if we use a relatively large h, and this is the whole 
point of using a rule absolutely stable in V, we cannot expect convergence if 
we use the iteration (5.5-3) since for systems of equations this requires that 
sho(J) < 1. Thus, viewing (5.10-10) as a nonlinear system of equations for 
yu, ..., y™,, we must use the Newton-Raphson method or one of its 
variants to solve (5.10-10). As this implies, the problems involved in solving 
stiff systems are quite formidable. 

A family of methods proposed by Gear has proved successful for stiff 
equations in a great number of cases. These methods are based on the 
implicit formulas (5.2-15) and (5.2-16) derived from numerical differentia- 
tion. Their properties are such that they are stable for all values of h such 
that Re h < —a for some relatively small positive value of a and accurate for 
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values of h in the rectangle —a < x < a, —b < y < b, b some small positive 
value. Thus for values of h where stability is the prime consideration, that is, 
Re h < —a, these methods are stable, and where accuracy is more impor- 
tant, that is, Re h > —a, they are accurate. Of course, whenever we try to 
achieve higher accuracy, stability deteriorates, but we should expect this. 
Gear’s method can be applied in the fixed-step mode (5.2-16) and in the 
variable-step-variable-order mode described in Sec. 5.7-1. The order must 
be limited to 5 because of stability considerations. The actual implementation 
of these methods is quite complicated, but the results attained justify the 
effort invested. Details can be found in the references given in the Bibliogra- 
phic Notes, where references to other methods for solving stiff systems can 
also be found. 


Example 5.10 Solve the following nonlinear stiff system, which arises in a problem of 
reaction kinetics: 


u’ = 01 — (.01 + u + v)f{1 + (uv + 1000)(u + 1)] = s(u, v) 
v= O1 —(01 +u + v1 + v7?) = t(u, v) (5.10-11) 
u(0)=v(0)=0 O<x< 100 


The eigenvalues of the Jacobian of this system depend on x. At x = 0 they are — 1012 
and —.01, while at x = 100 they are —21.7 and —.089, so that the system is initially very 
stiff (stiffness ratio + 10°) but much less so for large x (stiffness ratio + 245). 

No theoretical solution is known for his system, so that its exact solution was 
computed using the classical Runge-Kutta fourth-order method (5.8-45). Since p(J) = 
1012 for x = 0, and since the absolute stability interval for this method is (0, 2.78), the step 
size h must be less than .00278 and indeed, a computation with h = .00S blew up at the 
second stage in the integration while h = .004 and h = 1/300 also gave incorrect results. 
For smaller values of h, we got the following values at x = 100: 


h u(100) v(100) 

0025 —.9916424297  .98333 67696 
.0020 —.991642)286 .98333 64258 
0010 — 99164 20711 98333 63603 
.0005 —.99164 20699  .98333 63589 


so that to eight figures, we have 
u(100) = —.99164207 v(100) = .98333636 


Note that the use of (5.8-45) required 400/h function evaluations, so that for h = .001 
the right-hand side of (5.10-11) was evaluated 400,000 times. 

The system (5.10-11) was also solved using the implicit methods of Gear of orders 3 
and 4. The equations for these methods, derived from (5.2-16), are, respectively, 


Yn+1 = Til l8y, — 9¥_—1 + 2y,—2 + OhE(%, 415 Yn+1)] 
and Ynt+i = 73[48y, _ 36y,-1 + l6y,,- 2 ~ 3y,-3 + 12hf(x,, 41, Yn+ i) 
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These equations were solved using a step length h = .1 and starting values obtained from 
the Runge-Kutta solution. The implicit equations were solved both by Newton’s method 
and by the modified Newton method (sec Sec. 8.8). The details of the computation for the 
third-order case are as follows (those for the fourth order case are similar). 

The implicit equations we had to solve had the form 


c(u,+ 1> Un+ 1) = Un+1 ~ Bu, + TUn- 17 TiUn—2 ~ As(un +15 V,+1) = 0 
d(u,,4 I> Un+ 1) = Une ~ 180, + Tai ~ Fi, -2 ~ At(u,. I> Unt 1) = 0 (5.10-12) 


where h = &h, and where u,, u,_ 1, U,— 3. V_s V_—y, and v,_ remain constant during the 
iterations for the solution of (5.10-12). 
The Jacobian of this system is 


Os Os 
Ihe AG, 
J(u, v) = ; ; 
t t 
—he 1 - h~- 
1 — A(g(u, v)(2u + 1000) + r(u)) —hr(u) 
—h(1 + v?) 1 — A(g(u, v)(2v) + (1 + v?)) 
where g(u, v) = OL + u + v, r(u) = 1 + (u + 1000)(u + 1), and 
Ot Os 
- _ h~ hs. 
“=D a as 
h— 1—-h— 
Ou Ou 


where 


_ Os 2 Os Ot 
p= (1-AS)(1— 155) ee 


As initial approximations to u,,, and »v,,, we took wu, =2u,—u,_, 


py), = 2v, — v,_,. The iteration using Newton’s method was 
+1 qi) (t) (i) 
ue [Yn C(un 1 ON 1) 
(i+) ] 


Un+t -1 : 

(i) (i) 
(i) ’ —J (Uns 1» v nt) dy (i) 
Un4+1 Un+ (une a Un+1) 


while in the modified Newton method, the same formula was used except that 
J *(u,, vo, ,) was replaced by J~ !(u®,, v,). 

The results obtained with two iterations using Newton’s method and the modified 
Newton method were identical to more than eight figures and are as follows: 


Order u(100) v( 100) 
3 — 99164187 98333613 
4 — 99164208 98333637 


We thus see that in this stiff system comparable accuracy was attained with much less 
work. 
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BIBLIOGRAPHIC NOTES 


An excellent, general reference on the numerical solution of ordinary differential equations is 
Henrici (1962a). A more theoretical treatment of the material in Henrici and of the more recent 
developments in the field is given by Stetter (1973). Other recent treatments are those of Gear 
(1971), Lambert (1973), and Lapidus and Seinfeld (1971), while the more classical references are 
Collatz (1960) and Milne (1953). Unconventional approaches to the subject are given by Miller 
(1966) and Daniel and Moore (1970). Fox (1962) surveys the field of numerical solution of 
ordinary differential equations as well as integral and partial differential equations. This is only 
one of many conference proceedings dealing principally with the numerical solution of ordinary 
differential equations. 


Section 5.1 No student should approach the numerical solution of ordinary differential 
equations without a firm grounding in basic existence theory. Henrici (1962a) contains an 
introduction to this subject. More extensive references are Coddington and Levinson (1955) 
and Ince (1926). Discussions of hybrid methods will be found in Lambert (1973) and Stetter 
(1973). 


Section 5.2 The method of undetermined coefficients, which is now widely used in the 
development of numerical integration formulas, was introduced by Dahlquist (1956). A more 
accessible reference is Hamming (1959). For other examples of its use, see Hull and Newbery 
(1959), Ralston (1961), and Crane and Lambert (1962). 

The Adams-Bashforth and Adams-Moulton formulas are classical and are discussed 1n all 
the references mentioned above. The implicit formulas based on numerical differentiation orig- 
inate with Gear; see Gear (1971). 


Section 5.3 The method of this section, particularly the use of the influence function, is 
due to Milne (1949). Other examples of its use can be found in Hamming (1973) and Hilde- 
brand (1974). Hildebrand also considers some more sophisticated methods of deriving error 
terms. Barrett (1952) considers some matters related to the material of this section. 


Section 5.4 It is possible to find almost as many definitions of stability as there are 
references on numerical integration methods. Relative stability is considered by Hamming 
(1959) and Hull and Newbery (1962) and is discussed in detail by Ralston (1965). Many authors 
define stability only for the case h = 0. Henrici (1962a, b) has excellent discussions of this case; 
see also Hildebrand (1974) and Lambert (1973). Much of the material of Sec. 5.4-2 is from Hull 
and Newbery (1961). 


Section 5.5 All the books on the numerical solution of ordinary differential equations 
mentioned above, as well as the book by Hamming (1973) and, in particular, the paper by 
Hamming (1959), are good references on predictor-corrector methods. The last is the source of 
most of Sec. 5.5-4. Other papers on the use of predictor-corrector methods are those of Hull 
and Newbery (1959, 1961), Ralston (1961), and Crane and Lambert (1962). A technique to 
remove the instability of Milne’s method is considered by Milne and Reynolds (1959, 1960). 


Section 5.6 For the use of Picard’s method in existence theory, see Ince (1926). A recent 
discussion of the use of Taylor-series methods is given by Barton et al. (1971). There is a good 
discussion of non-Runge-Kutta methods for starting the solution in Hildebrand (1974). 


Section 5.7 Other examples of the use of predictor-corrector methods are given by Milne 
(1953), Henrici (1962a), Hildebrand (1974), and Hamming (1973); see also Ralston (1961). 
Nordsieck (1962) discusses the numerical solution of ordinary differential equations in an 
overall computational sense. His approach has been extended and implemented in a very 
successful computer program, applicable also to stiff systems, by Gear (1971). 
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The variable-order-variable-step method described here is based on that of Byrne and 
Hindmarsh (1975). Another approach is the subject of a book by Shampine and Gordon (1975), 
which gives a fully documented computer program implementing their method. 

Most of the generally accepted methods for the numerical solution of ordinary differential 
equations are compared in Hull et al. (1972). 


Section 5.8 Much of the discussion of Runge-Kutta methods is taken from Ralston 
(1962b). Kopal (1961), among many others, also has an extensive discussion of Runge-Kutta 
methods. The pair of Runge-Kutta formulas (5.8-47) and (5.8-48) is given by England (1969). 
Other such pairs for various orders m are given by Fehlberg (1969, 1970). A slightly different 
approach in the same spirit appears in Zonneveld (1970). 

The advantages and disadvantages of high-order Runge-Kutta methods are discussed by 
Curtis (1975). The step-size strategy in Sec. 5.8-7 is essentially that of Shampine and Allen 
(1973). Lambert (1973) is the source of Sec. 5.8-8 on stability of Runge-Kutta methods, includ- 
ing Example 5.7. 

An interesting generalization of the Runge-Kutta idea to integrate forward N steps at a 
time instead of a single step has been proposed by Rosser (1967). 


Section 5.9 Henrici (1962a), Hamming (1973), and Hildebrand (1974), among others, have 
discussions of methods based on higher derivatives. 

Extrapolation methods for ordinary differential equations were first suggested by Gragg, 
who developed the formulas (5.9-15); see Gragg (1965). A computer program implementing one 
of these methods is given by Fox (1971). Example 5.9 is from Dahlquist and Bjérck (1974), from 
whom the terms active and passive extrapolation are taken. 


Section 5.10 The definition of stiffness is that given in Lambert (1973), which is also the 
source of Example 5.10. A full description of Gear’s method appears in Gear (1971). A survey of 
numerical methods for stiff systems appears in Lapidus and Seinfeld (1971). A variable-order- 
variable-step method applicable to stiff systems is described in Byrne and Hindmarsh (1975). A 
comparison of various methods for solving stiff systems is given by Enright et al. (1975). 
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PROBLEMS 


Section 5.1 


1 (a) Prove that a system of ordinary differential equations can be written in the form 
(5.1-1) if and only if the system can be rewritten with the highest-order derivative in each 
variable appearing as the left-hand side of one equation and nowhere else. 

(b) Write the following system in the form (5.1-1): 


d>y + dy dz , ay 

—_ + — — + xy?z —. = sin 

dxi dxdx. > dx? y 
d?z dy | ,d’y (* * dy | 
dx?dx > dx? a dx 


z 


e” 
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Section 5.2 

2 (a) Derive a numerical integration method of the form (5.2-3) for n = 1, my = 0, and 
any m, by using a truncated Taylor series. 

(b) Explain how you would use the method of part (a) to find the solution of y’ = xy?, 
y(0) = 1. With m = 4 and h = .2, find an approximate solution at x = .2 and .4 and compare 
with the true solution. 

3 When the system (5.1-1) is linear, show that there is no computational difference be- 
tween using a forward-integration formula and an iterative formula. 


4 (a) Derive a fifth-order iterative formula of the form 
Yn+1 = Yn + DYg—1 + CV_-2 + AVn—3 + h(CYn+ 1 +LVn + 9Yn-1) 


Express each coefficient in terms of b. Find a value 0 < b < 1 that leads to a simple set of 
coefficients. 
(b) Repeat part (a) with the dy,_, term replaced by a term dy’,_ ,. [Ref.: Ralston (1961).] 
5 (a) For j =0, 1, 2, 3 use the method of undetermined coefficients to derive fourth-order 
forward-integration rhethods of the form 


Yna1 = 4jYn-; + h(by,, + cy, 1 + dy, 2 + ey,-3) 


Show how each of these methods could have been derived by integrating the Lagrangian 
interpolation formula for y’(x). 

(b) Verify that Eqs. (5.5-29) are such that (5.5-28) will be fourth order. 

6 (a) Derive the Lagrangian form (5.2-5) of the Adams-Bashforth formula (5.2-13) for 
p = 0, 1, 2, 3. 

(b) Repeat part (a) for the Adams-Moulton formula (5.2-14). 

(c) Repeat part (a) for the implicit formula (5.2-16). 


Section 5.3 


7 Prove that if the influence function G(s) does change sign over the interval of integration 
[a, b], then there exists some continuous function y(s) for which 


[ Gls)y(s) ds # y(n) | G(s) ds 


for any n in [a, b}. 
*8 (a) Use (5.3-12) to derive the error terms for the Newton-Cotes closed formulas with 
n= 1, 2 on an interval [0, nh] and with w(x) = 1. 
(b) Consider the quadrature method 


J fla) dx =f(a)+f(-a) +B 0O<a<l 


which is exact for 1 and x. For what values of @ is the influence function of constant sign in 
[—1, 1]? Find E for those values of a. Explain the behavior at « = 1/,/3. [Ref.: Hildebrand 
(1974), pp. 212-213.] 

*9 (a) By calculating G(s) at x,_3,X,-2.X,—1 X,, and x,,,, surmise for which values of b 
in Prob. 4a the error can be expressed in the form (5.3-2). Why is this technique not sufficient to 
prove that the error can be expressed in this form? 

(b) For the values of b found in part (a), calculate the explicit form of the error term. 

*10 (a) For each value of j in Prob. 5a, use the influence function to determine whether the 
error can be expressed in the form (5.3-2), and if so, find the error. 

(b) Do both parts of Prob. 9 for the iterative formula (5.5-28), whose coefficients are given 
by (5.5-29). 
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Section 5.4 


11 (a) Suppose that in place of (5.4-1) we have the system 
(YP) = —Kyyy'? — Kyzy™ y(Xo) = You 
(yY = —K,,y— K,,y” y(x9) = Yo2 


Carry through an analysis similar to that in Eqs. (5.4-3) to (5.4-5) to show that the stability 
equation is now two coupled polynomial equations. Thus deduce that the analysis of Sec. 5.4 
leads to generally intractable algebraic problems when more than one equation is being 
considered. 

(b) However, show that the convergence of a numerical integration method is independent 
of whether a single equation or a system is being solved. 

12 (a) By substituting (5.4-7) into (5.4-5) and using (5.2-6), show that 8B, = —K. 

(b) Use Cramer’s rule to express c;, i # 0, in (5.4-4) in a form similar to (5.4-10), and use 
this result to show that if y;> yo, j= 1,..., p, as h 0, then c; +0 as h-0 for i #0. 


13 (a) Prove Theorem 5.1 in the case where (5.4-14) has simple complex roots. Hint: If 
r = Re’ consider w, = hR* cos k@ and wf? — w, 4, Wy-1- 

(b) Use a similar technique to prove Theorem 5.1 in the case of multiple roots. 

(c) Show that for the equation y’ = 0, y(0)=0 the condition of Theorem 5.1 is also 
sufficient for convergence. [Ref.; Henrici (1962a), pp. 218-219.] 

14 (a) Show that the sequence (5.4-17) with A given by (5.4-18) satisfies (5.4-16). 

(b) Give an example of a numerical integration method which is convergent and satisfies 
only the first two equations of (5.2-6). 

*15 Suppose that | f(x, y)] < —K and |E,| < E for all n and all f(x, y) involved in 

calculating the solution of (5.1-1). Suppose further than | Khb_,| <1. 

(a) Show that the solution of the difference equation 


p 
Qna (1 + h|b_,|K)= > ([a,| — h[b,|K)e,-;+ £ 


is such that 
le; <e; i=pt+l1,p+2,... 


where ¢; is the solution of (5.4-24), if the initial conditions for the difference equations are such 
that 


le.|<e, i=0,...,p 


Thus deduce that the solution of the above difference equation dominates that of (5.4-24). 
(b) Show that if all the a,s are positive, the difference equation of part (a) has a particular 
solution E/Kho, where o = )P__,|b;|. 
(c) Show that if again all the a,’s are positive, one solution of the homogeneous difference 
equation formed by setting E = 0 in part (a) is given by e, = rg, where 


Kh P 
ro=1- = + Oh?) o'= Yb 


(d) Suppose |e;| <4, i=0,..., p. Show that 


E 


= Ar" 
ro Kho 


e (ro — 1) 


is a solution of the difference equation of part (a) which is greater in magnitude than e, for all n, 
and thus deduce a bound for the propagated error. [Ref.: Hildebrand (1974), pp. 266-268, and 
Hull and Newbery (1961).] 
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*16 (a) Show that the transformation 


maps the unit circle in the z plane into the left half of the w plane. 
(b) Show that the result of applying this transformation to the polynomial equation 


Coz tcy2" ' +++ +0, ,2 +c, =0 
is a polynomial equation 


dow" +d,w"'+--+d,_.w+d,=0 
n 
where d,= Vcr sai 
i=0 


with the rf defined by 
(l+x)(l—xyt= ¥ rx 
=0 


(c) Deduce then that (5.4-5) has all its roots on or inside the unit circle if and only if the 
polynomial in w in part (b) has all its roots in the left half plane or on the imaginary axis where 
pti 
d, = » (—a;_, + hKb,_ rer par sy a_,=1 
i=0 
(d) The Hurwitz-Routh criterion states that the polynomial in w in part (b) has all its 
zeros in the left half plane or on the imaginary axis if and only if when d, > 0 (<0), all the 
principal minors of the matrix D = [D,,] are nonnegative (nonpositive), where 


Di = doisi-j i,j =0,1,...,n—1; d,=Oifk<Oork>n 


Use this criterion and the results of parts (b) and (c) to find those values of b for which the 
method of Prob. 5b is convergent. 

(e) For b = 0, 4, find those values of Kh for which the roots of the stability equation 
(5.4-5) lie within the unit circle. Does this help determine when the method is stable? Why? 
[Ref.: Ralston (1962a).] 


*17 (a) A theorem of Schur states that the roots of the polynomial 
Coz" t+ c,2" '+-:°+0¢,=0 

will be on or within the unit circle if and only if the quadratic form 

n- 1 

» [(cox; t+ CyXja, t+ + Cy—j-1%a-1)" — (C,Xj + Cy Xjay Ho + Cpa Xn 1)7J 

j=0 
is nonnegative definite. From this derive the result that a sufficient condition for a numerical 
integration method (5.2-5) to be convergent is that the matrix with elements 


min(r. s) 


A,, = » (C,-1s-1 — Cpat+1—rCpstei—s) r,s=0, l,...,p 
'=0 


where c;= —a;_, j=0,1,....p+1 a_,=-!1 


be nonnegative definite. Why is this condition not necessary? 

(b) Use this criterion to repeat the calculation of part (d) of the previous problem. 

(c) If we wish to know for what values of Kh the roots of the polynomial equation related 
to (5.4-25) lie within the unit circle, what do we have to replace c,; by in part (a)? 
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(d) Repeat Prob. 16e using this criterion. Why is this method more difficult to use than 
that of the previous problem? [But, see Emanuel (1963).] [Ref.: Wilf (1959).] 


*18 (a) For b = 4 determine for what values of Kh the stability polynomial for the method 
of Prob. 4a has all its roots on or inside the unit circle. 
(b) Show that, independent of the value of b, the iterative formula of Prob. 4b is never 
stable. [Ref.: Ralston (1961).] 


19 (a) Show that when K < 0,at least one root of the stability equation (5.4-5) always lies 
outside the unit circle for sufficiently small values of h. 

(b) Discuss the merits and limitations of a definition of stability which requires only that 
the roots of (5.4-5) lie within the unit circle when K > 0. 


Section 5.5 


20 (a) With b = 4 in the iterative formula of Prob. 4a, determine how large h may be for 
(5.5-10) to be satisfied for the differential equation y’ = —y, y(0) = 1. 

(b) For this value of h and b = 4, use the result of Prob. 9b to determine the truncation 
error of the method for the same differential equation. 

(c) Repeat parts (a) and (b) for the iterative formula of Prob. 5b with b = 0, 1 using the 
result of Prob. 10b. Show that when b = 1, this is Milne’s corrector and when b = 0, Hamming’s 
corrector. 

(d) Give an intuitive argument to justify the assertion that unless |hK| < 1, the trunca- 
tion error will not be very small. 

21 (a) Using the notation of (5.2-5), show that by integrating Newton’s backward for- 
mula we can obtain a class of predictors of order p + 1 with ay = 1,b_, =0, and a; =0,i #0. 

(b) Using the error term in Newton’s formula, derive the truncation errors for (5.5-13) to 
(5.5-15). 

22 (a) Show how Newton’s backward formula can be used to derive the class of correc- 
tors known as Adams-Moulton correctors 


Pp 
Yne1 =Vat D di-i 
i=—1 


by suitably varying the limits of integration from those of the previous problem. 
(b) Find an expression for the error term for this class of formulas. 
(c) Display the error terms for p = 0, 1, 2, 3. 


23 (a) Show how Newton’s backward formula can be used to generate Nystrom’s 
predictors 


p 
Yaeit = Ya-1 + Yb: 
i=0 


(6) Find an expression for the error term. Why does the problem of finding the error term 
differ from that in Probs. 21 and 22? 

(c) Display the formulas of this class for p = 0, 1, 2. For p = 0, 1 display the error term by 
making use of open Newton-Cotes quadrature formulas. 


24 (a) Show how the modified Hermite interpolation formula can be used to generate 
formulas of the form 


P q 
Yati = Dv a Vn-i + LA ar qsp 
i=0 i=s 


which have an order of accuracy p+ q+1--s. 
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(b) In particular, derive the predictor P, of Table 5.1. 
(c) With p = 2, q = 2, s = 1, derive the predictor and compare its error-term coefficient 
with that of P,. 


25 Use the error terms ofthe predictor and corrector of (5.5-11) to derive an estimate of the 
truncation error in both predictor and corrector. Use the estimate of the predictor error to get a 
method analogous to (5.5-27). 


26 (a) Show that if the estimate of the corrector error in (5.5-27) were used to modify the 
corrector, the resulting method would be of fifth order. 

(b) Generalize this result to prove that if the estimate of the corrector error is used to 
modify the corrector, the resulting predictor-corrector method always has an order of | greater 
than that of the predictor and corrector. 


*27 (a) Ifa term b, y’,_ , is added to the corrector (5.5-28), derive the equations analogous 
to (5.5-29) if the corrector is still to be of order 4. Let a, and a, be the free parameters. 
(b) If this corrector is used to compute the solution of y’ = 0, y(0) = 0 but the initial 
conditions of the difference equation are y_, = y_, =0, yo = € (a roundoff error), show that 
the solution of the difference equation is 


€ 


=C,rmi+C,r3+C where C, = ——_-_-__ 
Yn 174 202 3 3 i+a, + 2a, 


(c) Thus deduce that if the corrector is stable, the growth of this single roundoff error (in 
fact, of course, a new roundoff error will be introduced at each step) is determined by 
1+ a, + 2a,. If 1+ a, + 2a, 1s positive (it usually is; see Table 5.2), show that there will be 
no magnification of the roundoff error if a, > (—a, /2). Which methods in Table 5.2 satisfy this 
condition ? 

(d) The above indicates that the accumulated roundoff error is correlated (not completely 
random) because of the relation between successive values of y, through the corrector. What 
factor should be kept small to control the random component of the error? Compare this 
quantity for the correctors of Table 5.2. 

(e) Assuming that the influence function G(s) for the corrector of part (a) is of constant 
sign over the necessary interval, find the form of the error term. 

(f) Use the result of Prob. 10b to deduce that, with b, = 0 and —.6 <a, < 1, G(s) is of 
constant sign, and thus calculate the last row in Table 5.2. [Ref.: Hamming (1973), pp. 369-371, 
395-401]. 

*28 Consider the corrector (5.3-8). 

(a) Use Prob. 16 to show that if the corrector is convergent, then —1 <a <1. 

(b) For these values of a show that if the roots of (5.4-5) are to lie within the unit circle 
when K > 0, then Kh < (6 — 6a)/(1 + a). 

(c) Find the value of a which maximizes the range of Kh in part (b) and is such that part 
(a) and (5.5-10) are satisfied. For this value of a, determine the other coefficients. This method 
has been called the maximally stable formula of this type of third order. (Note, however, that 
here stability is defined as in Prob. 19b.) [Ref.: Wilf (1960).] 


29 (a) Use the errors of the predictor and corrector of (5.5-33) to derive the modifier 
equation in (5.5-33). 

(b) Find the expression for the truncation error in (5.5-17). 

(c) Display the system of equations analogous to (5.5-27) and (5.5-33) for the corrector of 
Prob. 4a with b = 4, using as a predictor (5.5-17). Use the results of Prob. 9b. [Ref.: Hamming 
(1959) and Ralston (1961).] 

30 (a) Derive (5.5-1) and (5.5-2) using the method of undetermined coefficients. 

(b) Derive the same formulas by integrating the proper Lagrangian interpolation formula. 

(c) Can integration of the Lagrangian interpolation formula be used to derive all formulas 
of the form (5.2-5) with only one value of y; on the right-hand side? Why? 
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*31 Obrechkoff’s method. Suppose we wish to find an equation for the numerical integra- 
tion of y’ = f(x, y) which has an error of the form 


h 


E= Gy | x"(x — hy Y"* (x) dx 


where h is the step size. 
(a) Integrate by parts r times to obtain 


1 
—___. (r+1) r _ r 
= Gal f Y (x) = a (h — xY] dx 
(b) Integrate by parts r additional times and translate the origin to x, to obtain the desired 
numerical integration formula 


(2r — (2r — j)!I hi 


ony 7 [yet 1 +(—1P"' yO] +E 


Inet Yet Gey DW 


where X,,, =X, +h. 
(c) Show that the error term can be — in the form 


h2r* 1 


E=(-ly Y2rt Wy 


2r+1 yi 


with x, <4 <xX,4,- [Ref.: Hildebrand (1974), pp. 284-285.] 


32 In order to avoid a calculation of f(x, y), it is possible to predict y’ directly. 

(a) Display an extrapolation formula which predicts y,,, uSiNg y,, Va—1) +++> Va-e- 

(b) What is the error in this formula? 

(c) Discuss the relative merits of this type of prediction and the prediction of y as dis- 
cussed in Sec. 5.5. Can an estimate of the error at each step be found when y’ is being 
predicted? {Ref.: Ralston (1961).] 


33 (a) Use the results of Prob. 1, Chap. 4, for k = 1 and n = 3 to derive the corrector of 
(5.5-11) by solving the second equation for f(a,), inserting the result in the first equation and 
making the appropriate changes in notation. 

(b) Derive formulas for y;,i = 0, 1, 2, 3, in terms of y,, i = 0, 1,2, 3. For each equation also 
derive the error term. 

(c) Eliminate y, and y, from the first three of the equations derived in part (b) to get 


h 
D (Syo + 8y, ~ y4) 


(d) Eliminate y, from the first two equations in part (b) to get 


Yi — Yor 


V2 = Syp — 4y, + 2h(yo + 2y') 


(e) Show how the equations of parts (c) and (d) together form a self-starting numerical 
integration method by using part (d) to “guess” the value of , required in part (c). Show that 
the error term of the overall method is O(h*). [Ref.: Wilf (1957).] 


Section 5.6 


34 (a) Show that starting values for a numerical integration method can be found by 
letting 


w(x)= > A,(x — xo) 


k=0 
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substituting this in (5.1-1), letting x — x. = hs, and getting a recurrence relation for the A,’s, A, 
being determined from the initial condition. For what types of functions f(x, y) is this method 
most applicable? 

(b) Show that this method gives results identical to the Taylor-series method of Sec. 5.6-1. 

35 (a) Complete the calculation of Example 5.6 to get values of y(.01s), s = 1, 2, 3 using 
terms through the third derivative. Compare these results with the true values. 

(b) Repeat the calculation of part (a) for h = .1. 

36 (a) Derive Eqs. (5.6-3). 

(b) Show that the truncation error in all the equations is O(h*). For which equation can a 
truncation error of the form (5.3-2) be calculated easily? Why? 

(c) Use these equations to find initial values for the differential equation y’ = y, y(0) = 1 
with h = .1. Use 1.1, 1.2, and 1.3 as initial guesses of y,, y., and y3, respectively. Carry eight 
decimal places and compare the answers with the true values. 


37 (a) If a Hermite interpolation formula is used to halve the interval in a numerical 
integration using the predictor-corrector method (5.5-33), how many points should be used in 
the interpolation? If y(x,) has been computed using the interval h and y(x, + 4h) is to be 
calculated next using (5.5-33), what interpolated values have to be computed? 

(b) If the interval is to be doubled, show that no restarting is necessary if enough past 
values have been saved. For the method (5.5-33), how many past values, that is, y,_ 1, y,- 2, etc., 
must be available for this method? 


38 For the differential equations 


y+y=0 y(0) = 1 (1) 

y + 2xy = 2x3 y(0) =0 (2) 
y+yt+xy’ =0 y(0) = 1 (3) 
y—2ytanx=2tanx y(0)=1 (4) 


(a) Use five terms of a Taylor series to obtain values of y at x = 0.1, 0.2, 0.3. 

(b) Solve (1) using Picard’s method. 

(c) Compute starting values for equation (1) at x = 0.1, 0.2, 0.3 using the method of 
Sec. 5.6-2. Use initial values y, = .9, y, = .8, y,; =.7, and do three iterations. 

(d) Repeat part (a) using the Runge-Kutta method of (i) second order (5.8-38), (ii) third 
order (5.8-42), (iii) fourth order (5.8-46). 


Section 5.7 


39 (a) Use the data of Table 5.3 for x =.1, .2, .3, .4 to calculate y at x =.5 using 
Hamming’s and Milne’s methods (skipping the modifier step in both cases since this ts the first 
predictor-corrector step.). Estimate the error using the technique of Sec. 5.5-3 and compare 
with the actual error. 

(b) As in part (a), calculate y at x = .6, this time including the modifier step. 


40 (a) Compare the roundoff properties of Milne’s and Hamming’s methods (cf. 
Prob. 27). Would you expect the difference to be significant in the calculations of Table 5.6? 
(b) Show that roundoff error dominates in the calculation of Table 5.6 for large x. 

(c) Why should the interval be increased if possible when roundoff is the dominant error? 


41 For the four differential equations of Prob. 38, using the results of (iii) of Prob. 38d, 
continue the solution from x =0.3 to x = 1.0 with h =0.1, using (a) the trapezoidal rule 
method (5.5-11); (b) Euler’s method (5.5-13); (c) Adams’s method with p = 2(5.5-15); (d) Milne’s 
method (5.5-27); (e) Hamming’s method (5.5-33). In (a), (d), and (e) estimate the truncation 
error at each step. In all five cases compare the calculated and true values. How do you account 
for the uniformly poor results for equation (4)? 
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Section 5.8 


*42 (a) Derive Eqs. (5.8-15) to (5.8-18). 

(b) If the w,’s, «,’s, and B,/s are to be independent of f(x, y), why must the expressions in 
brackets on the left-hand sides of (5.8-21) and (5.8-22) equal the corresponding operators on the 
right-hand sides? 

43 (a) Verify that for Runge-Kutta methods of order 2 and 3, m must be at least 2 and 3, 
respectively. 

(b) Prove that m > M for all orders by showing that the coefficient of h™ on the right- 
hand side of (5.8-1) always contains a term cwy(f,)”~ 7D, f, where c is a constant, and no other 
terms of this form in w,, j < M. 


44 (a) Starting from (5.8-16) to (5.8-18), derive the error bounds (5.8-32) to (5.8-34). 

(b) Show that «, = 4 minimizes y,. 

(c) Show that a, = 4, a, =? minimizes y,. [Ref.: Ralston (19625).] 

45 (a) Verify (5.8-41). 

(b) Show that when a, = 0, there are no Runge-Kutta third-order methods. 

(c) Derive the one-parameter family of third-order Runge-Kutta methods when a, = a, 
and when a, = 0. [Ref.: Ralston (19625).] 


46 (a) Use the results of Prob. 44 to verify that (5.8-38) is that second-order Runge-Kutta 
method for which the bound on jy, is minimized. 

(b) Verify that (5.8-42) is that third-order Runge-Kutta method for which the bound on y, 
is minimized. [Ref.: Ralson (1962b).] 


47 For fourth-order Runge-Kutta methods: 

(a) Verify (5.8-44). 

(b) Show that no solutions are possible when a, = 0 or «, = 1. Why is it reasonable to 
require «, #0 in any Runge-Kutta method? 

(c) Find the one-parameter family of solutions when a, = a3,%, = 1,a, =0.[Ref.: Kopal 
(1961), pp. 206-209.]. 

(d) Verify that (5.8-47) is of fourth order. 

48 (a) For what values of a, and a, other than the special cases of the previous problem 
can B3,., B42, Or Bg; be infinite? 

(b) What happens to the corresponding weights in these cases? 

(c) Even if the corresponding weight is zero and this term is dropped in (5.8-1), will the 
method be of fourth order? 


49 (a) Compare the errors in the second-order Runge-Kutta method (5.8-38) and the 
second-order corrector of (5.5-11) when f(x, y) = x? and when f(x, y) = y. Use (5.8-28) for the 
Runge-Kutta error. 

(b) Do the same for the fourth-order Runge-Kutta method (5.8-46) and the fourth-order 
corrector in (5.5-33) for f(x, y) = x* and y. 

(c) Why do the results of part (b) indicate that it is very difficult to assess the truncation 
error in a particular Runge-Kutta method using simple examples of this type? How does the 
highest-order term in (5.8-27) affect the comparative error estimates? Which of the two func- 
tions f(x, y) = y and f(x, y) = x™ is a more realistic test of the value of a method? 


*S$0 For a system of two simultaneous differential equations 


d 


y dz 
x = f(x, y, Z) x > g(x, y, 2) 


the Runge-Kutta equations corresponding to (5.8-1) are 


Yn+1 —~ Va = > wk; Znt+1 — 2n = > vm; 
i=1 i=1 
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i-} i-1l 
where k, = haf ( + Ohya, Yn t+ > Byki, 2+ ¥ rm) 
j=1 j=l 


i-1 i-t 
m, = hyo x, +a;h,, y, + >.B,;k;> z,+ > ium) 
j=l j=1 


(a) Use the analog of (5.8-14) to show that the coefficient of h', in k, and m, contains no 
term which includes a product of a 8,, and y,, for t < 4. 

(b) Deduce then that a Runge-Kutta method of order m < 4 for the above system is given by 
using w;, «;, B;; as for a single equation and by having v; = w,, Bi, = yi;- 

(c) Can this result be generalized for a system of N equations? Why? 

51 For the equation d*y/dx? = f(x, y, y’) use the equations in the previous problem to 
write the equations for the fourth-order Runge-Kutta method corresponding to (5.8-45). How 
are these equations simplified when f is independent of y’? 

52 (a) Verify (5.8-52). 

(b) Determine the values of y, for the six-stage method of order 5 given by (5.8-48). 

(c) Verify that the interval of stability for a three-stage method of order 3 is (0, 2.51). 

(d) Repeat part (c) for m = 4 and (0, 2.78). 


Section 5.9 


53 (a) Verify (5.9-5) and (5.9-6) using (5.9-3) and (5.9-4). 
(b) Use the Hermite interpolation formula to derive (5.9-8) and (5.9-9). 
(c) Verify the modifier equation in (5.9-10) and find an estimate of T,. 


54 (a) Derive the equation analogous to (5.4-5) for Eq. (5.9-3). 
(b) Deduce that any method of the form (5.9-3) is stable for some range of values of Kh. 
(c) Deduce in particular that (5.9-5) is stable for all values of Kh. 


55 For the differential equation y” = y, y(0) = 1, y'(0) = — 1 use Hamming’s method with 
h = 0.1 applied to two first-order equations to take the solution to x = 1.0. Use (5.8-46) to get 
starting values. 


56 Use (5.9-10) to calculate the solutions of the four differential equations in Prob. 38. 
Use h = .1 and the results of (iii) of Prob. 38d to carry the solution from x = 3 to x = 1.0. 
Compare the results with those of Prob. 41. Instead of starting the computation at x = .4, where 
could you have started it? 

57 Solve the differential equation of Example 5.9 using the sequence H/2, H/4, H/6, H/8, 
H/12,.... 

58 (a) Apply (5.9-13) and Richardson extrapolation to calculate the solutions of the four 
differential equations in Prob. 38 at x = 1 using the sequence H/2, H/4, H/8, .... 

(b) Repeat part (a) using the sequence of Prob. 57. 

59 Using passive extrapolation and the trapezoidal rule (5.9-15), solve the following three 


differential equations at the values x = 1(1)10, and compare with the values in Tables 5.3, 5.4 and 
5.6. 


y=-y y(0) = 1 (1) 
y=y y(0) = 1 (2) 
y(0) = 0 (3) 


y T+ tan? y 
Use h; = H/2'. 
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Section 5.10 


60 (a) Compute the solution of (5.10-1) at x = 3 using (5.9-13) and Richardson extrapo- 
lation with the sequence H/2, H/4, H/8, .... 

(b) Repeat part (a) using the sequence of Prob. 57. 

(c) Repeat part (a) using Hamming’s method. 

(dz) Repeat part (a) using the Hermite method. 


61 Prove that the trapezoidal rule (5.10-10) is absolutely stable in the entire left-hand side 
of the complex plane. 


CHAPTER 


SIX 


FUNCTIONAL APPROXIMATION: 
LEAST-SQUARES TECHNIQUES 


6.1 INTRODUCTION 


Polynomial interpolation is a method of approximating the value of a func- 
tion at a point by means of a polynomial passing through known functional 
values. A major virtue of this method of approximation is its ease of 
implementation. Another virtue is that it leads to an expression for the 
truncation error in the approximation which can often be estimated or 
bounded. Implicit in our discussion of polynomial interpolation in Chap. 3 
was the assumption that truncation and not roundoff error was the major 
source of error. 

In this chapter we consider the problem of approximating a function 
whose values at a sequence of points are generally known only empirically 
and thus are subject to inherent errors which may be large. Thus roundoff 
will be a serious source of error and often the controlling source. Moreover, 
it is often the case that an approximation to such a function is desired which 
can be manipulated analytically—in particular differentiated—with a reason- 
able degree of accuracy. This, we saw in Chap. 4, is generally difficult with 
exact polynomial approximations. In fact, the 1/h* factor in the numerical 
differentiation formula (4.1-15) means that if the inherent error in the func- 
tional value is large, differentiation will cause a serious “noise” problem. 
This 1s in contrast to the case where the functional value can be calculated to 
the full word length (say 10 decimals) of a digital computer, so that the 
roundoff error will be small. The subject of this chapter, least-squares 
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approximations, is concerned with a technique by which noisy functional 
values can be used to generate a smooth approximation to the function. This 
smooth approximation can then, for example, be used to approximate the 
derivative of the function more accurately than exact approximations. 

The reader may be familiar with the principle of least squares as applied 
to continuous functions over an interval [a, b]. In this chapter we shall be 
concerned entirely with the principle of least squares as applied to functions 
known only at a discrete set of points. However, the derivation of the prin- 
ciple of least squares in the next section for a discrete set of points is precisely 
analogous to the derivation in the continuous case. 

Our reason for emphasizing the discrete rather than the continuous case 
is that this 1s the case of interest in numerical applications. Approximating 
continuous functions by least-squares techniques is, of course, of great the- 
oretical interest. In the case where the approximating functions are polyno- 
mials, such approximations of continuous functions lead naturally to the 
development of the orthogonal polynomials discussed in Chap. 4. As we 
shall see in Sec. 6.4, orthogonal polynomials also play an important role in 
discrete least-squares approximations. 


6.2 THE PRINCIPLE OF LEAST SQUARES 


We are now ready to make precise our heuristic definition of least-squares 
approximations in Sec. 2.1-2. Let f(x) be a function and {x;},i= 1,...,n, be 
a sequence of data points at which we have observed values of f(x) which 
generally will be in error. We denote f(x;), the true value at x;, by f;, and we 
denote the observed value at x; by f;. We define E; = f; — f;. Throughout this 
chapter we shall assume that the errors at different data points are uncor- 
related, ie., independent. 

Let {f,(x)}, j =0, 1, ..., be a (generally finite) sequence of functions 
defined for every x;. Then our object is to approximate f; by a linear combin- 
ation of the {p,(x)} 


f,~ YaMo(x) i= 1,...,0 (6.2-1) 
j=0 
with the a!” to be determined so that 


H(a§”, ..., af) = ¥ w(x, | fi- Fax) 


— y w(x;)R? (6.2-2) 


is minimized. The function w(x) is called the weight function and is assumed 
to be such that w(x;) > 0,i = 1,...,. The quantity R; is called the residual at 
x;. The superscript m on a!” denotes the fact that the coefficient of ¢,(x) will 
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generally depend on m, although in Sec. 6.4 we shall see that this is not 
always the case. Having determined the a‘” so as to satisfy (6.2-2), we have 
then an approximation 


yl) = Ya (x) (6.2-3) 


which is called a least-squares approximation of f(x) over {x,}. We can use 
this approximation not only at the points {x;} but also at other values of x. In 
this sense, our development is analogous to that of Chap. 3, in which func- 
tional values at a discrete set of points were used to derive an approximation 
to be used over an interval. 

If 6,(x) = x/ and there are no more data points than parameters in the 
approximating polynomial, that is, n < m+ 1, then by making the summa- 
tion in (6.2-3) the Lagrangian interpolation polynomial corresponding to 
the points {x;}, we would have y, =/f;, i= 1, ..., n, where y, = y(x;). Since 
(6.2-2) would then be zero, this would be the desired minimum. In this 
chapter, however, we shall be concerned with the case n > m + 1; that is, we 
use a number of approximating functions less than the number of data 
points. Thus, for example, we might approximate a function known at five 
data points by a polynomial of degree 1, that is, m = 1. We could derive a 
polynomial of degree 4 passing through these five points, but such a fourth- 
degree polynomial will not enable us to smooth the empirical data. 
However, as we shall see, such smoothing is possible in general when 
m+i<n. 

Graphically, this is illustrated by Fig. 6.1. Suppose we have empirical 
data at five points on a function which is in fact linear as shown. The errors 
in the empirical data as such are much too great to allow approximating the 
true function by an exact linear approximation using any two points 
(although, if we knew which two points to choose—which we never shall— 


/ © Empirical data points 

True function 

g “~=<- == Exact polynomial approximation 
7 —— -— Linear least-squares approx. 


——~+-—— Exact linear approximation 


Figure 6.1 Least-squares and exact approximations. 
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we could get quite a good approximation this way). If we pass a fourth- 
degree polynomial through the five points, we get an approximation whose 
deviation from the true function is not too bad but whose derivatives are 
much different from those of the true function. On the other hand, the linear 
least-squares approximation not only lies close to the true function but also 
has a similar slope. 

Although the considerations above were for the case where the set 
{p,(x)} is the powers of x, the points we have made are equally true for other 
sets of functions, in particular for the functions {sin jx, cos jx}, which we 
shall also consider in this chapter. 

The use of Eq. (6.2-2) as the one to be minimized instead of 


Y w(x) [Ri (6.2-4) 


or max w(x;)|R;| (6.2-5) 


i=1,....n 


- 
II 
— 


is motivated by an analytical consideration. If the magnitude of the error is 
the quantity in which we are directly interested, minimizing (6.2-4) would be 
more desirable than minimizing H. Minimizing (6.2-5) (the so-called mini- 
max approximation) has the advantage of giving us a sure bound on the 
error at any data point. The desirability in some circumstances of having 
such a sure bound on the error, not just at a discrete set of points but over a 
whole interval, is the motivation behind the whole of Chap. 7. But for gen- 
eral application with empirical data, the thing that rules out (6.2-4) and 
(6.2-5) is the fact that the absolute value is not a differentiable function of x. 
This makes the determination of the constants a” substantially more difficult 
in general using either (6.2-4) or (6.2-5) than using (6.2-2), although linear 
programming methods can be used in both these cases to determine the 
constants (see Sec. 9.10). 

To calculate the a9””s, we take the partial derivative of H in (6.2-2) with 
respect to at”) and set it equal to 0, thereby obtaining 


0H ” 
= 2 w, 
day” uM 


,(x;) = 0 


k=0,...,m;w;=w(x;) — (6.2-6) 


fi — and ji) 


Equation (6.2-6) is a system of m+ 1 linear equations for the m+ 1 un- 
known a'””s, This system is called the normal equations. If the determinant of 
the coefficients does not vanish, we can solve for the a‘””s. By considering 
H(as” + Aag, ..., a” + Aa,,) it is not hard to show that this solution is 
indeed a minimum. 


Our basic assumption in this chapter is that for some unknown value of 
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m, say M, the true function f(x) can be expressed as a finite linear combina- 
tion of the set of functions {¢,(x)}; that is, we assume 


M 
f(x)= 2a bilx) (6.2-7) 
jJ> 
Now clearly this assumption will not always be satisfied in practice, but if 
the assumption is a good approximation to reality, the results we shall derive 
based on this assumption will be useful. 


6.3 POLYNOMIAL LEAST-SQUARES 
APPROXIMATIONS 


In this and the next section, we consider the case in which ¢,(x) is a 
polynomial of degree j. In particular, in this section we shall consider the 
case @,(x) = x’ and w(x) = 1. For this case Eq. (6.2-6) becomes, after cancel- 
ing the —2, 


2 (7 — Yap =O k=0,...,m (6.3-1) 
i= jJ= 


Interchanging summations, we can rewrite (6.3-1) as 


» mal yx") =) fixt k=0,...,m (6.3-2) 
Dix = yx] ** Pr= > Axt (6.3-3) 
a 


Sana = py k=0,...5 (63-4) 
0 


Using matrix calculus, it can be proved that the least-squares problem and, 
thus the system (6.3-4), has a unique solution. We leave the proof to a 
problem {2}. 


6.3-1 Solution of the Normal Equations 


We seem at this point to have solved the least-squares problem for the case 
p(x) = x’, w(x) = 1. All we need do is perform the perhaps tedious calcula- 
tions required to solve the normal equations (6.3-4). And indeed for small 
values of m, say up to 5 or 6, experience indicates that the solution of (6.3-4) 
produces quite good least-squares approximations. But for greater values of 
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m, the solutions found by solving (6.3-4) generally lead to progressively 
poorer least-squares approximations. Moreover, this is quite independent of 
which of the many methods available for the solution of (6.3-4) (see Chap. 9) 
is used. An explanation of this can be found using the following argument. 
For convenience let us assume that the points x; are all in the interval 
(0, 1). Further, let us assume that they are distributed fairly uniformly in this 
interval. Then g,, as defined in (6.3-3) has the form of n times a Riemann 
sum. For large n, then, the approximation 
n ,1 n 
du= Lx en | xitk ax J k=0,..., m (6.3-5) 
should be a good one. Let G = [g,,] be the matrix of coefficients in (6.3-4). 
Using (6.3-5), we approximate G by n times the matrix H, where 


; 1 1 1 

2 3 m+ 1 
ao | 

2 3 4 m+2 

H= (6.3-6) 

IL l 

3 m+ 3 
I eee I 

m+ 1 2m + 1 


This matrix is the principal minor of order m+ 1 of the infinite Hilbert 
matrix. This matrix is a classical example of an ill-conditioned matrix (cf. 
Sec. 1-7). A matrix is ill-conditioned if when it has been normalized so that 
its largest element has order of magnitude 1, as, for example, in (6.3-6), its 
inverse has very large elements. Thus, for example, when m = 9, the inverse 
of (6.3-6) has elements of magnitude 3 x 10'?. The result of this is that any 
roundoff error incurred in entering the coefficients of H,,, into the computer 
(and such errors are inevitable because the elements of H generally have 
infinite binary or decimal expansions) will result in a matrix whose true 
inverse has greatly magnified errors. For quite small values of m, therefore, it 
becomes impossible to compute an accurate solution to a set of linear equa- 
tions whose coefficient matrix is H,,. (It might be expected that roundoff 
errors introduced in the calculation of the inverse would make this problem 
much worse, but, interestingly enough, this is not so. The inverse of the 
matrix whose elements are rounded values of those in (6.3-6) can be cal- 
culated very accurately [see Wilkinson (1961)] although, of course, it still 
differs considerably from the true inverse of H,,,.) 

For m even as large as 9, the situation as we have presented it is so bad 
that even though G is only an approximation to a Hilbert matrix, we still 


FUNCTIONAL APPROXIMATION : LEAST-SQUARES TECHNIQUES 253 


expect to have a great deal of difficulty in solving the normal equations. 
Some actual examples to illustrate how hard it is to solve the system (6.3-4) 
with any degree of accuracy are considered in the problems {5}. 

The previous argument is a cogent one against using @,(x) = x’ for all 
but very small values of m. But to make the case even stronger, we now 
consider this class of functions from another standpoint. 


6.3-2 Choosing the Degree of the Polynomial 


Given a value of n, how is one to choose m, the degree of the polynomial 
approximation? This problem is analogous to choosing the order of an 
interpolation or quadrature formula as discussed in previous chapters. But 
whereas in those cases we were interested in making an error term small and 
at the same time being able to estimate it, here the considerations are different. 

Our basic hypothesis is that the true function f(x) is a polynomial of 
degree M <n or at least can be accurately represented by such a polyno- 
mial. A priori we do not know what M is; our problem is to find it. If we 
choose a value of m < M, then clearly it is impossible to get a good represen- 
tation of the true function. On the other hand, choosing a value of m > M 
also defeats our purpose. We have pointed out that by choosing m=n— | 
we can make 


n 


a= Swik? = Ywi( Ji Sasi) w= w(x; (63-7) 


i=1 i=1 


equal to 0. But in so doing, we shall have lost all smoothing properties of 
least-squares approximations. In fact, any value of m> M sacrifices some 


smoothing. 
When we are using powers of x, (6.2-7) becomes 
M 
fx)= Yale (6.3-8) 
j=0 


Therefore, if we knew M and calculated the least-squares approximation 


M+1 


Yu +(x) = d aj * x) (6.3-9) 


j=0 


using the observed data {f;}, then statistically alf';';’) should be 0. That is, if 


there were no errors in the data, it would be 0, but because of these errors, it 
will not be 0 even if the assumption that f(x) has the form (6.3-8) is correct. 
We should like then to test the statistical hypothesis that aff!) = 0. In 
order to be able to do this, we make the one further assumption that the 
errors E; are normally distributed with zero mean and variance o7/w;. This 
assumption is reasonable because more accurate measurements, 1.e., those 
with small variance, will usually be more heavily weighted. 
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This statistical hypothesis that we wish to test is often called the null 
hypothesis. It can be tested using maximum-likelihood statistical methods, a 
discussion of which is beyond the scope of this book [see Wilks (1962)]. Here 
we only state the result that if the null hypothesis is correct, then the ex- 
pected value of 


62 


n—-m-— 1 


O;, = (6.3-10) 
will be independent of m for m= M, M + 1,..., n— 1. Thus in practice, 
since we do not know M, we would wish to solve the normal equations 
(6.3-4) for m = 1, 2,..., compute o7,, and continue as long as a7, decreases 
significantly with increasing m. As soon as a value of mis reached after which 
no significant decrease occurs in a2, this m is that of the null hypothesis and 
we have the desired least-squares approximation. In order to guard against 
the possibility that the underlying function is odd or even and that successive 
values of o,, will therefore be nearly equal before m= M, in practice we 
should stop the computation only after several o,,, are almost the same. 

This means that we must compute the solution of the normal equations 
for a sequence of values of m.f Although some of the computations in the 
solution of these equations for m = r can be used to compute the solution for 
m=r+ 1, there is nevertheless significant additional calculation at each 
stage [see p. 423]. This and the problems adduced in Sec. 6.3-1 constitute 
a strong case against using ¢,(x) = x’. In the next section we shall indicate 
how both the analytic problems of Sec. 6.3-1 and the computational 
problems discussed in this section can be avoided by the use of orthogonal 
polynomials. 


6.4 ORTHOGONAL-POLYNOMIAL APPROXIMATIONS 


If p,(x) is a polynomial of degree j, the least-squares approximation of degree 
m can be written 


= 2 bY” p; (6.4-1) 
In order to minimize 


H(b&, ..., Db) = <w lS; — Ym(Xi)]? (6.4-2) 


+ In practice, using our knowledge of the problem, we would often start with a value of 
m> 1. 
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we proceed as in the previous section. We get, corresponding to (6.3-4), 


Y dab” =a, k=0,...,m (6.4-3) 


where dy, = b wiPi(Xi)P(X:i) Oe = Wi SiPe(x;) (6.4-4) 


For arbitrary choice of the {p;(x)}, the computational problems involved in 
solving these normal equations can be just as serious as before. If, however, 
the {p,(x)} are chosen so that the nondiagonal terms of the matrix D = [d;] 
are small compared with the diagonal elements, then the matrix D, unlike G, 
will not be ill-conditioned. In particular, if the {p,(x)} are orthogonal over the 
sets of points {x;}, the off-diagonal terms will all be 0. By definition, a set of 
polynomials {p,(x)} is orthogonal over a set of points {x;} with respect to a 
weight function w(x) if 


Lm P(xi)Pe(xi)=90 iff #k 
W; = w(x;) (6.4-5) 


where the superscript denotes the fact that the polynomial will depend on 
the number of points n. We assume in what follows that w; > 0 for alli. Ifthe 
{p"(x)} are orthogonal, then, as defined in (6.4-4), dj, = 0, j # k. The system 
(6.4-3) then becomes 


din bi = WM, k = 0, woe, M (6.4-6) 
which has the immediate solution 


bom =k k= 0,0... m (6.4-7) 
kk 
thereby eliminating the problems of solving an ill-conditioned system of 
normal equations. Moreover, the solution with m replaced by m+ 1 is 
given by 
poe Pk fH 0 mt it (6.4-8) 


kk 
with w, and d,, again given by (6.4-4). Thus 
bm) = pint) k=0,...,m (6.4-9) 


Therefore, to compute the solution for m + 1, we need only compute wy, 41 
and dn+1.m+ 1» Since (6.4-9) indicates that b, is in fact independent of m, we 
shall from now on drop the superscript. 

The development above indicates that the use of orthogonal polyno- 
mials enables us to avoid the difficulties of both Secs. 6.3-1 and 6.3-2. We 
now proceed to consider the generation of polynomials orthogonal over 
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discrete sets of points which need not be equally spaced. A convenient 
method for this is the Gram-Schmidt orthogonalization process. We begin 
with a set of m polynomials q,(x), j= 0, 1, ..., m— 1, where q,(x) is a 
polynomial of degree j, which are linearly independent over the set {x;}. That 
is, there exist no constants {c,} other than c, = 0, j = 1,..., m — 1, such that 


m-—-1 
Y ¢ja(x:)=0 i=t...n (6.4-10) 
j=0 


A convenient choice for {g,(x)} is often 1, x, x”, ..., x"~'. In any case let 


Po(x) = Go(x) 


p(x) = 9,(x) — D drs Pox) j=l,...,m-1 (6.4-11) 


so that p,(x) is a polynomial of degree j. Suppose we have determined po(x), 
D1(x), ---, Dy(x) So that (6.4-5) is satisfied. Then, to determine p, , ,(x) ortho- 
gonal to all p(x), j < k, we use (6.4-11) with j = k + 1 to write 

k 


5 Ww; aa( X;)p (x; = Smid als 2 duct WiP-(X X;)P;(x:) 


~. 
Il 
— 


j=0,1,...,k (64-12) 


We wish the left-hand side to be 0. By orthogonality all the terms in the 
double summation on the right-hand side are 0 except the term in r = j. 
Therefore 


y Wi dk + 1 (x;)p j(x;) 


In this way p, (x) is determined orthogonal to all p,(x) of lower degree, and 
continuing this process leads to a set of m orthogonal polynomials. The 
Gram-Schmidt process is also commonly used to generate sets of functions 
orthogonal over an interval or sets of orthogonal vectors {7}. 

A more convenient and efficient method for the derivation of orthogonal 
polynomials over discrete sets of points is the use of recurrence relations. 
Suppose that {p,(x)} is any sequence of polynomials satisfying the orthogon- 
ality relationship (6.4-5) with respect to some positive weight function w(x) 
and some sequence of data points {x,}. We shall show by induction that 
there exists a relation of the form 


Dj+ (x) = (x — o541)P(x) — Bypj-i(x) f= 9,1... 
Po(x) = 1, p-1(x) = 0 (6.4-14) 
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where a;,, and f, are constants to be determined. For j=0 (6.4-14) 
becomes 


Pi(x) = (x — %) (6.4-15) 


The relation (6.4-5) requires that 
Wi Po(Xi)P1(x;) = Dd, wil: — %1) = 0 (6.4- 16) 
i=1 =1 


from which it follows that 


a= (6.4-17) 


Let us suppose that for j= 0, 1, ..., k the polynomials p,(x) satisfy a 
relationship of the form (6.4-14) and the orthogonality relationship (6.4-5). 
Then we wish to show that we can choose «,,, and 8, so that with p,. ,(x) 
defined by (6.4-14) 


Yoweri) =0 7-01... (6.4-18) 


~. 
Il 
— 


Substituting (6.14-14) with j = k into (6.4-18), we have 


¥ wixiPxi)Pel%: ) = Os by Wj iPj(x X;)Py(x i) 


i=l 


— B, Loin,le X;)Px- 1 (x ;)=0 j=9Q, l,...,k (6.4-19) 


For j = 0, 1,..., kK — 2 the last two terms on the left-hand side of (6.4-19) are 
identically 0 by the induction hypothesis. Moreover, for these values of j, 
x; p;(x;) in the first term is a polynomial of degree no greater than k — 1 and 
thus can be expressed as a linear combination of the p,(x), j =0,...,k — 1. 
Therefore, again by the induction hypothesis, the first term is also 0. For 
j=0, 1,...,k — 2, then, (6.4-5) is satisfied for any choice of a,,, and f,. 

For j = k — 1 the second term is still 0, and so we get the requirement 
that 


ees tilatiele (6.4-20) 
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the second form following from use of (6.4-14). For j =k the third term 
vanishes, and we get 


| wi Xi[Pi(x;)]? 
~, = (6.4-21) 


~ 


Thus, we have the result that with f, and a, given by (6.4-20) and (6.4-21), 
the polynomial of degree k + 1, p,4,(x), defined by (6.4-14), satisfies the 
orthogonality relation (6.4-5). This proves our assertion that a recurrence 
relation of the form (6.4-14) exists. We have assumed in the above that the 
denominators in (6.4-20) and (6.4-21) do not vanish. A denominator can 
vanish only if 


p{x)=0 i=l,...,n (6.4-22) 


for some j. If (6.4-22) held for no j, we could generate an unending sequence 
of polynomials p,(x). Given n data points, we expect to be able to generate at 
most n independent polynomials po(x), .. , p,—1(x). Therefore, it is no 
surprise that we can show {8} that if (6.4-14) is used to generate p,(x), then 
p,A(x;) = 90, i= 1,..., n. 

Using (6.4-14), (6.4-20), and (6.4-21), we can generate the least-squares 
approximation 


Vm(X) = 2 bipilx) (6.4-23) 
j= 
with b, =—! (6.4-24) 
Yj 
where oO; = y Ww: iP (Xi) (6.4-25) 
i=1 
and yr » wilp(x;)]? (6.4-26) 
i=1 


This technique of generating orthogonal polynomials is a very powerful 
one, and since it is easily mechanized, it 1s very useful on computers. In 
addition, it provides a convenient method of evaluating the approximation 
Ym(x). It is not very hard to show {9} that the recurrence 


GidX) = by + (X = O41) 0%) — Bes 1 Mr2(x) k=m m—1,...,0 
Ym + 1(x) — Qm + 2(X) = 0 (6.4-27) 
is such that qo(x) = y,,(x). Similar recurrences can be used to evaluate the 


derivatives of y,,(x) {9}. 
When the data points are equally spaced and for the particular case 
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w(x) = 1, the orthogonal polynomials are called Gram polynomials. In this 
case it is convenient to use an odd number of points 2L + 1 


X,= Xo + sh s=-L,..., —1,0,1,...,L (6.4-28) 


It can be shown {16} that «;,, =0 for all j, so that the resulting recurrence 
relation is {16} 


] S , 
—— pj+ils, 2L) = — p\(s, 2L) — Bi; pj-1(s, 2L) j=), 1, wee 
j+l j €j-1 
Pols, 2L)=1 ~— p_,(s, 2L) =0 (6.4-29) 
2 2 -2 : 
_ FAL + te -s7] _ (2)! 1 


Before presenting an example of a least-squares orthogonal polynomial 
approximation in the next section, we develop and present here an algorithm 
for generating approximations of the form (6.4-23). Our technique is to use 
(6.4-14), (6.4-20), and (6.4-21) to generate the orthogonal polynomials or, 
more precisely, the values of these polynomials at the points x;. 


Input 
{Xi, Wis i} i=l,...,n 
m 


Algorithm 
for i= 1, ..., n do po(x,;) — 1; p_1(X;) —93 Ynlx;) 0 endfor 


Yo — dws Bo — 0 
for j “0, ..., mdo 
0; ~ Yow, Five 
bj — wj/¥; 
for i=1,...,ndo 
Yl Xi) — Ym(%i) + bj Pili) 


endfor 
if j = m then stop 


Oi+1 — > w,xi[P(x;)]7/9; 

i=1 
for i = 1, ..., 1 do p,4 4(x;) — (x; — %j+1)p,(x;) — Bj Pj-1(x;) endfor 
Vita — Lwilp(xi)l 


By+t — Viol; 
endfor 
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Output 
b,j =0,...,m 
a, j=1,...,m 
Bj, j= 1,...,m—1 
Vm(X;),i=1,...,n (smoothed values at the data points) 


Note that the «; and £; are listed as outputs in order to make it possible to 
use the recurrence relation (6.4-27) to compute y,,(x) at points other than 
the given x;. 


6.5 AN EXAMPLE OF THE GENERATION OF 
LEAST-SQUARES APPROXIMATIONS 


Suppose we are given the empirical data (cf. Example 4.1) 


10.3627 


and we wish to find the best least-squares polynomial approximation to f (x) 
with weight function w(x) = 1. Since the data points are equally spaced, we 
could use the Gram polynomials. Instead, however, we shall use the algo- 
rithm at the end of the previous section. Using this algorithm with m = 5 we 
calculate: 


J Yj Ww; b, a; B; 
0 9 62.80767 6.97863 
1 1 
1 6 3.80372 6.33953 — — 
2 15 
2 11.088 2.05302 8.16435 TT 
360,000 2 1500 
3 32,076 00711 4.98754 8! 
225 x 10° 2 1750 
4 49,049 00006 99883 13 
765,625 x 10° 2 315 
13 1 
5 —______ 0000003 12821 - 


6,250,000 2 
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and y.(.1)= 5.1234 —_y.(.2) = 5.3056 _—y(.3) = 5.5689 
y5(.4) = 5.9374 —y.(.5) = 6.4374 _—y(.6) = 7.0976 
ys(.7)= 7.9491 — y5(.8)= 9.0255 __y.(.9) = 10.3626 


Using the results computed with the algorithm we then, as in (6.3-7), 
calculate 


62 = y i - Earle} (6.5-1) 
We obtain 
62 = 26.202 62 =2.089 52 = 0355 
52 = 000059 62 = 00000049 = 6 = 00000045 —_(6.5-2) 
Then, using (6.310), we calculate 
a6 = 3.275 a? = .298 o% = .0059 
52 = 000012 o2=.00000012 o«2=.00000015 (65-3) 


from which we conclude that m = 4 gives us the best least-squares approxi- 
mation. This approximation is 


yale) = ¥b,p (65-4) 


with the b,’s given by (6.4-24) and p,(x) given by (6.4-14). Although there is 
normally no computational need to do so, we can convert (6.5-4) into an 
approximation using powers of x, in which case we get 


Ya(x) = .9988x* + 2.9898x> + 2.0172x? + .9920x + 5.0010 (6.5-5) 


In fact the values given in the table at the start of the section are perturba- 
tions of the values of 


f(x) = x* + 3x24 2x? 4+x45 (6.5-6) 


The true values of f(x) at the points given in the table are 


10.363 1 


In order to show the effect of the ill condition of the matrix G, let us now 
repeat the above computation using powers of x instead of orthogonal 
polynomials. Using (6.3-3) and (6.3-4), we wish to calculate a‘*), j = 0,..., 4. 
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We get for G, the matrix of the g;,, 


9.0 4.5 2.85 2.025 1.5333 

4.5 2.85 2.025 1.5333 1.20825 
G=] 2.85 2.025 1.5333 1.20825 978405 

2.025 1.5333 =: 1.20825 978405 8080425 

1.5333 1.20825 .978405 8080425 = .67731333 


1.00000 .50000 .31667 .22500 .17037 
50000 .31667 = .22500 = .17037 = .13425 

=9]| 31667 .22500 .17037 .13425 .10871 (6.5-7) 
22500 .17037 = =.13425 .10871 .08978 
17037 .13425 .10871 =.08978 .07526 


From (6.5-7) we see that 3G is quite close to (6.3-6), as we would expect since 
the points x; are all in the interval (0, 1) and are equally spaced. For the 
determinant of G we get 


det G = 000000014 (6.5-8) 


Therefore, we expect that the errors incurred in the calculation of G will 
cause substantial errors in the solution of the normal equations. We empha- 
size that the orthogonal-polynomial and powers-of-x formulations are two 
ways of stating precisely the same problem. Therefore, any difference be- 
tween (6.5-5) and the solution of (6.3-4) will be entirely due to different 
computational techniques. 

When (6.3-3) is used, the right-hand side of (6.3-4) is 


Po = 62.8077 p, = 35.20757 > = 23.944287 
p3 = 17.8176647 —_ p4 = 13.93266027 (6.5-9) 


If we solve the normal equations using Gaussian elimination (see Sec. 9.3-1) 
and carry six decimal places throughout the computation, we get, rounding 
the results to four decimal places, 


a = 9672 = a# = 3.0522. a) = 1.9763 
a — 1.0020 — a” = 5.0003 (6.5-10) 


which in the case of aY and a") have errors far larger than those in (6.5-5). 


A possible source of loss of significance when using orthogonal polyno- 
mials is in the computation of the b;’s. If, for example, the magnitude of p,(x;) 
is small for all x;, the calculation of the quotient w;/y; may result in a 
substantial loss of significance, particularly in fixed-point calculations {19}. 
This is not a problem when using the Gram polynomials with the normalized 
variable s, but it can be a serious problem when using the recurrence-relation 
technique. To avoid loss of significance in this latter case, it is desirable to 
scale and shift the data points from their original interval to a more conven- 
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ient one; Le., in effect to normalize the independent variable. Such a conven- 
ient interval [Forsythe (1957)] when w(x) = 1 is [—2, 2]. 

It is important that the reader clearly distinguish the two types of errors 
considered here. On the one hand, the ill condition of G causes the difference 
between the calculated coefficients given by (6.5-5) and (6.5-10). On the 
other hand, the difference between the coefficients in (6.5-5) and the true 
coefficients (6.5-6) is due to the inherent empirical errors in the data. 


6.6 THE FOURIER APPROXIMATION 


Particularly when the data are from a real-time application, there may be 
physical knowledge of the function f(x) which indicates that it is periodic. In 
this case it is advantageous to use the trigonometric (or Fourier) functions 
instead of polynomials as the least-squares approximating functions. In 
Sec. 6.6-2 we consider least-squares approximations based on the Fourier 
functions, but because the summations which arise in such approximations 
play a role in many applications besides least-squares approximations, we 
consider first, in Sec. 6.6-1, the evaluation of such sums by the algorithm 
known as the fast Fourier transform. 


6.6-1 The Fast Fourier Transform 


Let g,, k =0, 1,..., N — 1 be a set of complex numbers and let 


N-1 
G;= Y gpe7nin j=)Q, l,...,N—1 
k=0 
N-t | 
= Y g,w* where w = e27!/N (6.6-1) 
k=0 


Equation (6.6-1) is often called the discrete Fourier transform (DFT) of the 
sequence {g,} by analogy with the (continuous) Fourier transform 


co 
G(x) =| g(t)e?"* de (6.6-2) 
“= 00 
Indeed, there is a direct relationship between the discrete and continuous 
transforms, whose derivation we leave to a problem {21}. In the same way 
that the continuous Fourier transform can be inverted, so can the discrete 
transform, to yield 


N-1 
g=-— VGw* k=0,1,...,N-1 (6.6-3) 
j=o 
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Using the orthogonality relationship {22} 


"Swit" _|N if j =r (mod N) 


.6-4 
k=0 0 otherwise (6.6-4) 


it is not hard to show that if G, given by (6.6-1) is substituted into (6.6-3), the 
right-hand side gives back g, {22}. The G, and g, thus form a transform pair, 
and it is convenient to use the notation 
Gig, 

to denote this. Moreover, since both (6.6-1) and (6.6-3) are periodic with 
period N, we may consider G, and g, to be defined for all j and k with 
Gj4ny = G; and Oyen = Gu- 

Among the properties of the DFT we state the following, leaving the 
proofs to a problem {22}. 


Property 1 Linearity If G;<>g, and H,;<+h, and « and f are any com- 
plex constants, then aG, + BH; ag, + Bh,. 


Property 2 Shifting If G;g,, then 


wWGjog,, and G,,ow “g, 


J 


Property 3 Convolution If G;>g, and H;<h,, then 


| N-1 N-1 
N 2, G,H;-, gh, and G,;H;o 2, Ile 


In addition to these properties, there are many other results about the DFT, 
some of which are considered in the problems {21 to 24}. Our main interest 
here, however, is the calculation of the DFT, which we now consider. 

We begin by noting that, given the g,’s, calculation of the G, using (6.6-1) 
as given would require N complex multiplications and additions for each j 
or N? for all the G,’s. Now suppose N can be factored into 


N=nryrn°''T, (6.6-5) 


Corresponding to the indices j and k we define t-tuples (j,, ..., j,) and (kj, 
..., k,) such that 


J=Si tide trirads bo bre ta 
j,=9, 1,...,r7,-1 
salt (6.6-6) 
kK=k,+nkeo»:tnn-ikhe-2t oc tn rok 
k,=0,1,...,r,-1 


s=l,...,t 
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The integers j,,..., j, and k,,..., k, are called the digits of j and k, respec- 
tively. For example, if N = 30, then 


If j = 23, then j, = 1, j, = 2, j, = 3, and ifk = 8, thenk, = 0,k, = 1,k; = 3. 

Now we shall develop the fast Fourier transform (FFT) algorithm for the 
case t = 3 in order to keep the algebra relatively simple. The generalization 
for an arbitrary value of t will be fairly obvious {25}. Using (6.6-6), we write 


wik — witk3 t+ r3k2+r3r2k1) (6.6-8) 


and substitute into (6.6-1) to obtain 
r3—1 r2-1 r1-1 


G.= y yy y Jy, WIN ST 2)K 1 yi 3k2 yikes (6.6-9) 


J 
k3=0 k2=0 ki =0 


Now using (6.6-6), we have 
yirsr2 — wii tris2 +rir2j3)(r3r2) _ yiirsr2 (6.6- 10) 
since all other exponents have terms in r, 7,73 = N and w*% = 1, where «@ is 
any integer. Similarly 

wis = wUitrijars (6.6-1 1) 

Substituting (6.6-10) and (6.6-11) into (6.6-9), we obtain 

r3—1 {r2-1 [ri -1 

G;, jnjs yy | y | »y guys iran 


k3=0 (k2=0 [k1=0 


wt +rij2trir2j3)k3 


(6.6-12) 


Noting that the term in square brackets depends only on k, and the term in 
braces only on k,, we describe the FFT algorithm as follows: 


Input 
g, — N values stored in increasing order of the index k, that is, from (k,, 
k,, k,)= (0, 0, 0) to (r; —_— l, ry — L, r3—- 1) 


Algorithm 
folk: ka, ks) — % 


ri-—1 


AV k2, k3) — y folk, k2, k 3 )witrsr2ki (6.6-13) 


k,=0 


r2- 


1 
Fairs Jas ks) — » FiCirs ka, ky )wOr rere 
k2= 


0 


r3-1 


Palit dards) — > Salis jo, ka)wot 772 Frans 
k3=0 
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Output 


G,;=f3(/1. J25 Js) 


From (6.6-12) it follows that this algorithm does indeed compute the G,. 
Moreover, the number of complex multiplications and additions required in 
(6.6-13) for each value of the argument triple on the left-hand side is 


r, +r, + 73 


Since there are N possible argument triples, the total number of operations 
1S 


N(r, +72 +13) (6.6- 14) 


which is never greater than N?* and when r,, r, andr, are substantially 
greater than 1, is much less than N*. Moreover, if t > 3, the equation corre- 
sponding to (6.6-14) is {25} 


N(ry +r2+°°' +7,) (6.6-15) 


where the inequality with respect to N? is likely to be even more pro- 
nounced. Thus, the FFT algorithm is, at least at first glance, more efficient 
than direct evaluation of the DFT, and for large N the reduction in compu- 
tation may be quite dramatic. There is, of course, a substantial amount of 
bookkeeping implied by (6.6-13), which would seem to lessen the overall 
computational advantage of the FFT algorithm; we shall return to this point 
below. 

One aspect of (6.6-13) which tends to be confusing on initial exposure to 
the FFT is that the order of G,’s computed is different from the natural 
order. Consider the following case: 


N=12=2:2-°3 j=j,+2j,+4j;  k=ks3+3k,+ 6k, (6.6-16) 


In organizing the computation of (6.6-13) we would expect to choose the 
natural order of k from 0 to 11, as shown in Table 6.1. Equation (6.6-13) 
makes it clear that if we begin with go, ..., g,, in 12 successive memory 
locations, then at successive steps of the algorithm we may 


Overwrite fi(j1, k,,k3) on g, =folk,, k2, ks) 
Overwrite fo(/1, ja, k3) on fi(j1, k2, k3) 
Overwrite f3(i1, j2, /3) = Gj on fo(/s, jz, ks) 


But Table 6.1 indicates that the order of the G, in these 12 locations is not 
the natural order and that to obtain the natural order therefore requires 
some unscrambling. But we do note here that if we reverse the order of the 
digits j,;, /2, and j, and use the natural order of the reversed digits, we obtain 
the natural order of j as shown in Table 6.1. It can be proved that this digit 
reversal always results in the natural order of j. 
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Table 6.1 Correspondence between digits of k 
and j for case N = 12 


- Or Or Or OK OK 
—- OOO INMn PB WN — © 


KH eee eK COOTDCSO 


= IWOCUNURK-OANWRO 
KK OCOK-KOOHHOCO 


— pe 
— 


Implementation of the FFT algorithm is most convenient and efficient 
when N is chosen so that 


N=r 
Then from (6.6-15) it follows that the number of operations required is 


r 


N(rt) = Nr log, N = N log, N (6.6-17) 


log, r 
For fixed N the coefficient of N log, N in (6.6-17) is minimized when r = 3, 
which implies that N should be chosen as a power of 3. But for various 
reasons, most notably that the calculation and use of the powers of w is 
simplified, it is most common to choose r = 2. Because of the importance of 
this case, we now consider it in some detail. 

When N = 2', the digits of j and k are all either 0 or 1. Indeed, j,,...,/, 
and k,,..., k, are, respectively, just the bits of the binary representations of j 
and k. The algorithm (6.6-13), now expressed for a general t, becomes 


folk, k2, oe) k,) = Ik 


oeeevreeeeo vee eoe eee e vee see oe 


Slits das esos Ste Kigis ses k,) (6.6-18) 


ooverevreeeoe eevee oer eee see eee eevee wwe eee sr eer ereree eevee eee eee eese 
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The statement of this algorithm, however, can be much simplified, as follows. 
Define 


J,=0 

Jp=jp tite 42°45, 1a. t (6.6-19) 
and K, = kya, + 2kyeg to $200k, = 1 =0,...,t- 1 

K, =0 (6.6-20) 


Then we can write the index j in (6.6-18) as 
j=J,4+2'°)j,4+ 2'K, l= 1,...,t (6.6-21) 


It is not hard to show {26} that with J, and K, defined as in (6.6-19) and 
(6.6-20), the index j in (6.6-21) goes through its full range of values from 0 to 
2' — 1 for each |. The middle equation of (6.6-18) can then be written 


f(t 8+ Ky) =D fre Do thy + 2K wee 2m 

a (6.6-22) 
where the ranges of the indices are 

J,=0,1,...,27°'-1 K,=0,1,...,27°'- 1 (6.6-23) 

When /| = t in (6.6-22), the index of f, is, using (6.6-19) and (6.6-20), 

J,4+2'-4j,+2'K, =J,+ 2°74, 

= fit jzto + Ba +B 
which, using (6.6-6), is precisely j. Thus, f,(0), (1), ...,,,(2'~ ') correspond to 


j=0,1,...,2'~', and therefore the G,’s are calculated in their natural order. 
But this happens because 


Kyo =k, + 2k. +°°- +2'"'k, 


has its bits reversed from those of k, as can be seen from (6.6-6). Thus, in 
order for the G,’s to be computed in their natural order, the g,’s must be 
ordered in bit-reversal sequence; 1.e., the initial order of the g, must corre- 
spond to the order resulting from taking the bits of the binary expansion of 
K, namely k,, k,_,, ..., k,, and computing Ky as above. Example 6.1 will 
illustrate this explicitly. 

Now we split (6.6-22) into two equations, one for j,=0 and one for 
j, = 1, and write out both terms in the right-hand side sum to obtain 


Fl) + 2!K,) =fr_ (J, + 2'K,) 
+ fi- (J; + 2'K, + Ji- Myr 
Sid: + 2'K, + 2'- ") = fi- (J; + 2'K,) 
+ fi-4(J, + 2'K, + 2'7 1) (Jit 2I-1y2t-! 
(6.6-24) 
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Finally, we note that wi?’ 2") = w?'""' = —1, and we define 


p=J,+2'K, q=pt+2'"! (6.6-25) 


so that p and q both take on half the values from 0 to 2' — 1. Then we obtain 
the complete algorithm for this case as 


fila) =fi-1(p) —fi-1(q)w? (6.6-26) 


G,=f,(P) G, = f.(q) 


Thus, at each stage of the algorithm we proceed as follows: 


1. Let J, and K, run through all possible pairs of values in (6.6-23). 
2. For each pair calculate p and q as in (6.6-25). 
3. Then calculate fi(p) and f{(q) from (6.6-26). 


Normally in the computer implementation of the algorithm the quantities 
w!2'~" which appear in (6.6-26) are precalculated and stored in a table (or a 
set of coefficients from which they are easily calculated is stored in a table 
{26}). Note that after the complete calculation implied by (6.6-26), we must 
perform a digit reversal, as described above, to get the G, in their natural 
order. In the binary case we are considering, this bit reversal consists 
only of taking the bits of the binary expansion of j and computing /’ as the 
binary expansion of the reversal of these bits. The correct position of G; as 
calculated is then G,. Alternatively, as in Example 6.1 below, we may 
reorder the g, so that the G, are calculated in their correct order. 

These remarks and the algorithm itself imply that the bookkeeping in 
(6.6-26) is quite simple and straightforward. The savings achievable, there- 
fore, by reducing the N? operations for the DFT to the N log, N are con- 
siderable and for large N can be dramatic. The FFT algorithm in the form 
(6.6-26) or equivalent forms {27} has become extremely popular and useful. 

To conclude this section we note a couple of other features of the FFT: 


1. If we are given the G, instead of the g, , the algorithms (6.6-13) or (6.6-26) 
are easily applied, the only difference being the change of sign of the expo- 
nent of w from plus to minus. 

2. An important case is the one where the g, are all real and we are in- 
terested in the cosine transform 

N-1 . 
G;= ), 9, cos onjk (6.6-27) 
k N 


=0 
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The statement of the FFT algorithm for this case is easily derived from 
(6.6-13) or (6.6-26) {27}. Similarly we may also consider the sine transform 
N- 


1 9) * 
G,= ¥ 9, sino (6.6-28) 
k=1 N 


The example which follows has too small a value of N for the FFT 
algorithm to show any substantial efficiency over direct calculation of the 
DFT, but it does illustrate how the algorithm works. 


Example 6.1 Given the data 


compute G,,j =0, 1,..., 7. 
Since N = 8 = 23, we use (6.6-26). First we perform a bit reversal and reorder 
the g, to get f,. We do this by taking the bits of k and reversing them as in the table below: 


k ok; ok, ky, 4k, +2k,+k;=k k ok, ky k, 4k, 42k, +k, =k 
00 0 0 0 4 1 0 0 1 
10 0 1 4 5 1 0 1 5 
20 1 +40 2 6 1 1 0 3 
3 0 1 1 6 7 1 1 41 7 


Thus the order of fQ is with k’ in its natural order. 
We begin by calculating the necessary powers of w. We have 
2 
v2 (1 + i) 


2nt/8 


mn .,t 
w=e = cost +isin7g = 


4 
»_, vs v2 


w* =i w> = ~—(—1+ i) 


2 
Using these and (6.6-25) and (6.6-26), we then calculate 


0 1 2 3 4 5 6 7 
fo 1 0 0 0 b+i otti l-i 1H-i 
fi 1 1 0 0 2+2i 0 2-21 0 
fo 1 1 1 1 ! 4 4i 0 
fy=G, 5 1 -3 1 =3 1 5 ! 


The reader who wishes to check the values of G, using (6.6-1) will find that even in this 
simple case the FFT algorithm has much to recommend it over brute force. 
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6.6-2 Least-Squares Approximations and Trigonometric Interpolation 


Throughout this section we assume that the set of points {x,} is equally 
spaced. As in Sec. 6.4, we assume that the number of points n is odd and 
equal to 2L. + 1. The corresponding results when n is even are considered in 
a problem {31}. 

For convenience we set x; = 27i/(2L + 1),i=0,..., 2L. The set of func- 
tions {@ ,(x)} that we shall use are the 2L + 1 functions, 1, cos x, ..., cos Lx, 
sin x, ..., sin Lx. We limit ourselves to a number of functions equal to the 
number of data points because, as before, we expect to have no more than 
2L.+ 1 independent functions on 2L.+ 1 points. In fact, it is not hard to 
show {29} that sin kx or cos kx, where k is an integer greater than L, can be 
simply expressed in terms of one of the above functions on the set {x,}. 

Just as the Fourier functions satisfy an orthogonality relationship over 
the interval [0, x], so do the above functions satisfy an orthogonality rela- 
tionship over the discrete set of points {x;}. In fact, we can show that {30} 


0 j#k 
y sin jx; sin kx; = ate j=k#0 (6.6-29) 
0 j=k=0 
0 J#k 
cos jx; COS kx; = “+ j=k#0 (6.6-30) 
2L+1 j=k=0 
2L 
2 cos jx; sin kx;=0 all j,k (6.6-31) 


i=0 


where j and k are restricted to run between 0 and L. 

With 2L + 1 functions we would expect to be able to fit exactly the 
2L + 1 points {x;}, but in the least-squares context, this again is just what we 
do not wish to do. As before, we want to use enough functions to provide a 
good approximation to the true function f(x) but not so many that we lose 
smoothing. Suppose then, as before, that f; is the observed value of f (x) at x;. 
We approximate f(x) by 

Ym(X) =4a9 + ¥ (a; cos jx +b; sin jx) m<L (6.6-32) 

j=l 
in direct analogy with standard Fourier-series notation. Again we determine 
the a;s and b,s so that the sum of the squares of the differences f; — y,,(X;) 
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will be minimized. When we use the orthogonality relations (6.6-29) to 
(6.6-31), the normal equations (6.2-6) yield 


2 


a-sTy4 2 yi cos jx; 
2 ae 
= 3001 | ¥ feo +7 j=O0,...,m (6.6-33) 
2 ae 
b= 7 LA sin jx; 
2 we 
= 557. ¥ fs 7 j=l,...,m (6.6-34) 


In order to compute the coefficients in (6.6-33) and (6.6-34) we can, of 
course, evaluate the summations directly. But referring to (6.6-27) and 
(6.6-28), we see that these summations have just the form of those used in the 
FFT with g’s replaced by f’s and N replaced by 2L + 1. Although 2L is not 
usually large enough for use of the FFT to result in a great saving of time, it 
can conveniently be used to evaluate (6.6-33) and (6.6-34). Of course when 
the number of points is odd, we cannot use the power-of-2 FFT algorithm 
(6.6-26) but must use the more general case (6.6-13). Another algorithm 
which is much more efficient than direct evaluation for the summations 
(6.6-33) and (6.6-34) is considered in a problem {32}. 

The results of Sec. 6.3 can be applied to approximations of the form 
(6.6-32) just as they were to polynomial approximations. In particular, we 
can calculate 

! 


a0 S (a2 + sf) (6.6-35) 


j=l 


2L m 

=) 17 = ae + ¥ (a, cos jx; + b; sin jx;) 
=0 
2L 


by making use of the orthogonality relations (6.6-29) to (6.6-31) {30}. 


Example 6.2 Use the data in the following table: 


0 


j, | 3.0004 


5.7203 | 3.1993 — 8679 | 2.9890 | 4.0985 


to calculate Fourier least-squares approximations with m = 1, 2, 3. 
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By direct evaluation of (6.6-33) and (6.6-34) or using the algorithm of {32} we get 
Ay = 4.00022, a, = .99998 a, = 00011 a, = .00023 
b, =.00029 b,=2.99997 b, = .00002 


Using the second equality in (6.6-35), we calculate 
6g = 44.99932 6?=40.49950 62=.00031 62 =.00031 


To find the value of m which results in the best least-squares approximation, we 
would generally use (6.3-10) with 2(L — m) in the denominator (why?). In this example, 
however, it is clear—as indeed the coefficients a; and b; indicated—that m = 2 gives the 
desired least-squares approximation. In fact, the data in the table are slightly perturbed 
values of f(x) = 2 + cos x + 3 sin 2x. 


In Sec. 6.2 we noted that a least-squares approximation in which the 
number of functions used equals the number of data points results in an 
exact approximation. In particular, in the polynomial case, we get the 
Lagrangian interpolation formula. When our approximating functions are 
the trigonometric functions, we get a formula for trigonometric interpola- 
tion. In common with our approach to interpolation in Chap. 3, we shall 
assume that the given function values are exact (except perhaps for some 
roundoff error). 

When m = L, Eq. (6.6-32) becomes the discrete Fourier series for the 
function defined by the set of values {f;} 


L 
yi(x) = 4a + > (a; cos jx + b; sin jx) (6.6-36) 
j=l 
with the as and b,’s again given by (6.6-33) and (6.6-34). Now, however, 6? 
as given by (6.6-35) is 0. To prove this we substitute (6.6-33) and (6.6-34) into 
(6.6-36) to get 


2L L 
yz (x) = — Lf 5 + 2, (cos jx; cos jx + sin jx; sin a) 
y) 2L 1] L 
SEG] Lf 3 + 2 cos j(x; — “| (6.6-37) 


We wish to show that y,(x,) =f,. First we note that 


cos j(x; — x,) = cos "ii —k) 


= cos [(2L + 1 — j)(x; — x,)] (6.6-38) 


274 A FIRST COURSE IN NUMERICAL ANALYSIS 


Using this, we rewrite (6.6-37) at x = x, as 


yi (Xx) = 
9) 2L | L ] 2L 
spol LA 5 +5 2 cos j(x; — x,) + ; De 008 i(x; — x,) 
1 2k ok 
ay ae Lhd cos j(x; — X,) (6.6-39) 
But using (6.6-29) and (6.6-30), we can show that {33} 
by cos j(x; — x,) = +I 4 : (6.6-40) 
from which it follows that in (6.6-39) 
vil) =f (6.6-41) 


as we wished to prove. 

Equation (6.6-36) or, equivalently, (6.6-37) is an equation for trigonomet- 
ric interpolation which agrees with the observed data at x = x,;, i=0,..., 
2L, and can be used to interpolate at values of x # x;. In the form (6.6-37) 
the formula is the trigonometric analog to the Lagrangian interpolation 
formula for equally spaced data in the polynomial case. The coefficient of f; 


2 1 
—— |~ (x; — 6.6-42 
+ ilz* 2, ose) (6.6-42) 
is therefore the trigonometric analog of the Lagrangian interpolation poly- 
nomial [,(x). By suitably manipulating the term in brackets in (6.6-42), we 
can show that {33} 
sin (L + 3)(x; — x) 


L 
(x, — x) = eA 2 6-4 
+ ¥ cos j(x; — x) y sin x, —x) (6.6-43) 


and thus (6.6-37) can be written 


y,(x) _ a: , x sin (L + 2)(xi — x) ; (6.6-44) 


sin 5(x; — x) 


In contrast to the polynomial case, we cannot derive any very useful closed 
form for the error in the approximation (6.6-44); see, however, {34} for one 
expression for the error. 


BIBLIOGRAPHIC NOTES 


There is a wide literature on methods for fitting curves to numerical data. For a general 
reference covering the subject matter of this chapter as well as the statistical background and 
many related subjects we recommend Guest (1961). 
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Sections 6.1 and 6.2. Most numerical analysis texts contain some material on least-squares 
approximation; in particular, see Isaacson and Keller (1966), Shampine and Allen (1973), and 
Hamming (1973). The first of these also includes a good deal of material on least-squares 
approximations over continuous intervals. Davis (1963) and Rice (1964) consider least-squares 
approximation in the context of linear spaces; the latter also considers discrete least-squares 
approximations and the problem of minimizing (6.2-4). 


Section 6.3 Much of the material from this section is from Forsythe (1957). For more 
information on the ill condition of the Hilbert matrix, see Todd (1954), and for related computa- 
tional problems, see Wilkinson (1961). Guest (1961) discusses methods for the solution of the 
normal equations; see also Chap. 9. 


Section 6.4 Guest (1961) has an excellent chapter on the generation and use of orthogonal 
polynomials in least-squares approximations. Another good source is the paper by Birge and 
Weinberg (1947); see also Forsythe (1957) and Aitken (1932). A good basic treatment will be 
found in Shampine and Allen (1973). 


Section 6.6 Three good references on the fast Fourier transform are Davis and Rabinowitz 
(1975), Cooley, Lewis, and Welch (1977), and Brigham (1°74). For a much fuller discussion of 
the numerical analysis of periodic functions, see Hamming (1973). A good discussion of trigon- 
ometric interpolation will be found in Lanczos (1956); see also Lanczos (1938). The standard 
text on the subject of trigonometric series in general is Zygmund (1952). 
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PROBLEMS 


Section 6.2 


1 Given the data 


calculate the coefficients in the normal equations for m = 3, 4, 5 with (a) (x) = x’; (b) ¢{x) = 
P,(x), the Legendre polynomial of degree j. Use w(x) = 1. 


Section 6.3 


*2 Let v be a column vector such that v’ = (v,, ..., v,), where T denotes the transpose. 
Define the norm of v to be |\v|| = (v? + v2 +--+: + v2)'/?. 
(a) Show that the least-squares problem for polynomials with w(x) = 1 can be written in 
the form 


|f — Qa” ||? = minimum 


where f7 = (f,, ...,f,), (a)? = (a”, ..., a&™), and Q is an n x (m + 1) matrix with columns 
q; = (x4, x4, ..., xd)", 7 =0, ..., m. 
(b) If n > mand c is a column vector with m + | components, show that 
c7Q7Q0c >0 


with the equality holding only if c = 0. 
(c) Show that the minimum problem of part (a) is equivalent to minimizing 


(a™ — G~"g)’G(a™ — G~"g) + F7F— g7(G~")"g 
where G = Q'Q and g= Q'f. 
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(d) Use this result and part (b) to show that 
a” = G"'Q'f 


is the unique solution of the minimum problem. [Ref.: Forsythe (1957).] 

3 (a) If the set of points {x,} is symmetrically placed with respect to 0, show that the 
system of normal equations (6.3-4) can be “decoupled” into two sets of equations with (m + 1)/2 
equations in each set if m is odd and m/2 and (m/2) + 1 equations in the two sets if m is even. 

(b) Display these two sets of equations for the normal equations derived in Prob. la. 

(c) Can the systems derived in Prob. 1b be decoupled? What is the general rule? 

*4 A theorem of Cauchy states that if a,,...,a,,b,,...,b,are 2n numbers, the determinant 
with elements (a; + b,)"', i= 1,...,n,k =1,..., n, has the value 


H (a, - a,)(b; — by) 


— izk=l 
I] Hla + 6) 


(a) Use this theorem to show that A,, the determinant of H,, is given by 


1 
a; + b, 


(b) If A‘’ is the minor of the element in the ith row and jth column of H,, show that 


ny ne ae) 
"— [- DIG — D!'Pnt(n— i)! (n — jf)! —1 


Te + k)! 


(c) Use induction to show that 


n" n-1 (n? _ k2)r- kk! _ 


n!,~-, (n+k)! 

(d) Use the results of parts (a) to (c) to show that hi/, the element in the ith row and jth 
column of H;', is given by 

ol (n+i- (n+ j— 0)! 
i+j—1[(i—1)!(7—1)!P(n-—d!(n - 5)! 

(e) Use this result to show that 

(n + i)(n + j) 
(n+ 1—i)(n+ 1—/j) 


petted opiate (CUE (2n + 1)! (n+)! 
memes (nj) [nti 1) P(n + 1 -/)! 


(f) Use the results of part (e) to calculate H,', n = 2, 3, 4, 5. [Ref.: Savage and Lukacs 
(1954).] 

5 Using equations derived in Prob. 1, compute the coefficients of the least-squares 
approximations for m = 3, 4, 5 for the case @ (x) = x’. Use any technique to solve the normal 
equations. Also calculate the determinant of the coefficients in the normal equations (see 
Prob. 10). 


ij _ ij . - 
41 h' ij=ly....n 
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6 (a) Repeat the calculations of the previous problem for the case ¢,(x) = P(x). 

(b) Convert the Legendre polynomial approximations to the form of approximations in 
powers of x. Compare the coefficients with those found in the previous problem. How do you 
account for the differences? Which coefficients do you expect to be more accurate? Why? (See 
Prob. 10.) 


Section 6.4 


7 (a) Given m vectors v,, V2, .--; ¥_, Show how the Gram-Schmidt procedure can be used 
to generate a related set of orthogonal vectors. 

(b) Starting with the powers of x (1, x, x’, ...), show how the Gram-Schmidt procedure 
can be used to generate the Legendre polynomials (see Prob. 19, Chap. 4). 

8 (a) Prove that any polynomial of degree j satisfying the orthogonality relationship 
(6.4-5) with w; > 0 for alli has j distinct zeros interior to the interval spanned by the points {x,}. 

(b) Show that if p(x), j = 0, 1,...,n, area set of polynomials of degree j satisfying (6.4-5), 
then p!"'(x,;)=0,i=1,...,n. 

9 (a) By substituting for b, in (6.4-23) the value found by solving for b, in (6.4-27) show 
that qo(x) as defined in (6.4-27) is equal to y,,(x). 

(>) Derive a similar recurrence for y/,(x). 


10 (a) Use the data of Prob. | and orthogonal polynomials generated using (6.4-14) to 
generate a least-squares approximation for m = 1, 2, 3, 4, 5. Express the approximations as sums 
of powers of x. Compare these results with those of Probs. 5 and 6 and discuss the differences. 


(b) From these results which, if any, of these values of m would you choose to be M in 
(6.2-7)? 
11 (a) Show that 


A(u;v;) = u; Av, + 9,41 Au; 


(b) Use this to derive the formula for summation by parts 


s S+1 s 
y u; Ap, = u; v; —_ y Vi-4 Au, 
i=R R i=R 
(c) Derive also the alternative formula 
Ss S+1 s 
yy; Av; = uj- 10; — » 2%; Au; _ 
i=R R i=R 


12 (a) Ifthe N + 1 points x,,i = 0, 1,..., N, are equally spaced at an interval h, show that 
the orthogonality condition (6.4-5) can be recast as 


Yay 1(i) A’U,(i, N) = 0 
where A’U (i, N) = w;p,(Xo + ih) and q,_ ,(x) is a polynomial of degree j — 1 or less. 
(b) Use the results of Prob. 11 to derive the conditions 
U(k, N) =0 k=0,1,...,.j-1 
U(N +k, N)=0 =1,...,j 
(c) For w(x) = 1 show that 


A2*1U (i, N) =0 
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and then conclude that 
U (i, N) = Ajyi(i — N - 1) 


where Aj, is an arbitrary constant. 
13 (a) Use the series expansions for (1 + x)? and (1 + x)” to derive the identity 


2) Sd Se al 


, _(-p) 
q q! 


(b) If we define 


where p and q are positive integers, show that 


Caen 


(c) Use the results of parts (a) and (b) to derive the identity 


PO Vn 


*14 (a) Use the result of Prob. 13 to show that 
| j N—k\fi-j 
(n= y= (ny (wr )(} 
x j-k k 
(b) Derive the relation 
(4 — my” = @er® 
(c) Use the results of parts (a) and (b) and Prob. 12c to show that 


J (-1Iff(N-k)\, 
Uj(i, N) = (- It Aw YS (O jer 


(d) Then with w(x) = 1 use the definition of U; in Prob. 12a to derive 


, ; p+ k)O(N —k 
pi, N) = (— 1st Aw 2 (— IY U ‘ (" i) 


(e) Use this to derive 


j (j + a jfk) 
' — k 
a nL 


[Ref.: Hildebrand (1974).] 
15 (a) Use the results of Probs. 13 and 14 to show that p,(N, N) = 1 if By = (-1)’. 
(b) If N=2L and s is defined so that i= L+ s, show that the formula derived in 
Prob. 14e becomes 


d af UF KOM (L + 5) 
p(s, 2L)= ¥ (—1) (ky? (2b) 


k=0 
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(c) Use induction to prove that if s in p,(s, 2L) is replaced by sL, then 
lim p,(sL, 2L) = P,(s) 


L~ 0 
16 (a) Display the equations corresponding to (6.4-23) to (6.4-26) for the Gram 
polynomials. 
(b) If we define y(N) = ))7-o w,[p,(i, N)]’, show that 


y(N) = (— 1)! oj LUA +4, N) 


where c, is the coefficient of i’ in p,(i, N) and U, is as defined in Prob. 12. 
(c) Use this result and summation by parts to show that when w(x) = 1, 


> [p,(i, N)P? = Se a G+ I)%G+i-N—-1)?) 


| 
~ [Nop 2 x Vj + i) 2a) 


(d) Use this result to derive 


_ (2L + 14+ s)!(2L + J)! 
1+ DIAL 


(e) Use an argument based on symmetry to show that «,;, , = 0 in the recurrence relation 
for the Gram polynomials corresponding to (6.4-14). 
(f) Finally use this result to derive (6.4-29) and (6.4-30). 


Section 6.5 


17 Solve the system (6.3-4) with m = 4 for the data of Sec. 6.5 using any method and 
carrying (a) six decimal places; (b) eight decimal places. In both cases substitute the results back 
into the system of equations to see how well they satisfy the system. 

18 For the data of Sec. 6.5: 

(a) Use a Lagrangian five-point formula (see Sec. 4.1) to differentiate the data numerically 
at x = .3, .4, .5, .6, .7. In each case center the Lagrangian formula at the point at which the 
derivative is to be calculated. 

(b) Calculate the derivatives at these points using the least-squares approximation (6.5-5). 

(c) Calculate the derivatives using the true function (6.5-6). Compare these results with 
those of parts (a) and (b) and discuss the reasons for the errors. 


19 Given the data 


(a) Find the best least-squares orthogonal polynomial approximation to these data. 
(b) Calculate the residuals at the data points using the approximations of part (a). 
20 (a) Use (6.4-28) to (6.4-30) to derive the first six Gram polynomials for the case L = 4. 


(b) Use these and the data in Sec. 6.5 to derive a least-squares approximation to these 
data corresponding to that in Sec. 6.5. 


(c) Compare your results with those in Sec. 6.5 and discuss any significant differences. 


FUNCTIONAL APPROXIMATION: LEAST-SQUARES TECHNIQUES 281 


Section 6.6 


21 (a) In (6.6-2) let g(t) dt = ds(t) and so derive the Fourier-Stieltjes form of the Fourier 
transform. 

(b) Let s(t) be a function which is 0 except at the points 0, 1/N,...,(N — 1)/N such that 
s(k/N) = g,. With x = j show that the Fourier-Stieltjes transform of part (a) reduces to (6.6-1). 


22 (a) Using the definition of w in Sec. 6.6-1, verify the orthogonality relationship (6.6-4). 
(b) Use this orthogonality to verify Eq. (6.6-3) by substituting (6.6-1) into it. 

(c) Prove the correctness of the linearity, shifting, and convolution formulas for the DFT. 
23 (a) If 5(/) = 1 when j =0 (mod N) and 0 otherwise, show that 


N d(j)o1  1445(k) 


(b) Deduce from this that if each G, is replaced by G, — c for some constant c, then all the 
g, 8 are unchanged except g,, which becomes gp — c. 

(c) If the G;’s are observations of a random variable, express the mean of these observa- 
tions in terms of the g,’s. 


24 (a) If G;+g,, show that G_,-+g, and G,+g_, where the tilde (~) represents the 
complex conjugate. 
(b) If G; +g, and H,<+h,, show that 


N-1 1 N=! 


G,H,_; = > G;,,H,g,h_, 
0 =0 


1 
N N, 


r= 


(c) Use these two results to prove Parseval’s theorem, namely if G,;<+g,, then 


N-1 
Gil?= d lol? 
k=0 


1] N-1 
~ 2 
N j=0 


25 (a) Display the equations corresponding to (6.6-8) to (66-13) for an arbitrary 
value of t. 
(b) Show that (6.6-15) is the equation corresponding to (6.6-14) for arbitrary ft. 
26 (a) Show that the number of operations in the N = r' case is minimized for fixed N if 
r = 3. 
(b) Show that j in (6.6-21) takes on all values from 0 to 2‘ — 1 for each 1. 
(c) Suppose a table of sin (2ms/N), s = 0, 1,..., N/4, is stored in a computer. Show how w*, 
k =1,..., N, can be computed using only sums and differences of the stored quantities. 
27 (a) Writing jk in (6.6-8) as k(j, + rij. + rir2j3), develop the analog of (6.6-13) which is 
the so-called Sande-Tukey form of the FFT. 
(b) Similarly develop the Sande-Tukey form of the binary algorithm by displaying the 
equations corresponding to (6.6-19) to (6.6-26). 
(c) For both the t = 3 case and the binary case display the FFT algorithm for the sine and 
cosine transform cases, (6.6-27) and (6.6-28). 
28 (a) Let G, = 1, j = 0, 1, 2, 3. Calculate directly 


3-k 


H,= YG Gas k=0, l, 2, 3 
j=0 


(b) Define G, = 0, j = 4, 5, 6, 7, and compute g, using the FFT algorithm. 
(c) Compute |g, |? and then use the FFT algorithm to compute 


7 
H;=8), |ox|?w" 
k=0 
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(d) Compare these results with those of part (a) and use the results of Prob. 24b to explain 
the comparison. 
[Ref.: Cooley, Lewis, and Welch (1977).] 


29 (a) With x; = 27i/(2L + 1) show that 
cos kx, = cos (2L+1—k)x, and sin kx, = —sin (2L + 1 —k)x, 


(b) Thus deduce that for any k > L, cos kx; and sin kx,;, i = 0,..., 2, can be expressed in 
terms of, respectively, cos jx; and sin jx, for j < L. 


30 (a) Derive the result 


2L iaL 2 
y. elke sin a/2 « F anv 
k=0 
2L +1 a = 2nv 
where v is any integer. 
(b) Deduce from this that 
aL cos La sin (L + 4)a 1) 
>, cos ka = sin a/2 uP en 
k=0 
2L +1 a = 2nv 
5 sin k sin La sin (L + 4)q 
Q = ——-- 
kao sin a/2 
(c) Then letting « = 2nj/(2L + 1), deduce that 
2k j0 j#v(2L+4 1) 2E 
cos | = and ; : _ ) 
2X Pe lott tj =vQL 41) a sin J% 


where x, = 27k/(2L + 1). 
(a) Use these results and the identities for the products of two sines, two cosines, and one 
sine and one cosine to derive (6.6-29) to (6.6-31) and (6.6-35). 


31 (a) Derive the relations corresponding to (6.6-29) to (6.6-31) when the number of 
points n is even. 

(b) Derive the relations corresponding to (6.6-33), (6.6-34), and (6.6-35) when n is even. 

(c) When n is even consider replacing the L in (6.6-36) by L— 1 and adding a term 
4a, cos Lx. Find an expression for a, analogous to (6.6-33). Why isn’t it reasonable to add a 
term in sin Lx? 


*32 (a) With the definition 
2L 
Viix)= Vfisin(i-k+1)x k=1,..,2L 
i=k 


Vri41(X) = Voy 4 2(x) = 9 
derive the recurrence relation 
fi, sin x + 2 cos x Wa 4(x) — Ya a(x) = Vx) 


(b) Defining U,,; sin x, = V,(x,), find the analogous recurrence for U,;. 
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(c) Then show that a, and 5, in (6.6-33) and (6.6-34) are given by 


a= (5-5) [Jo + Was 008 on) —U,, 
7 \2L4+1 7 2L +1 ) 


[Ref.: Goertzel (1958).] 


*33 (a) Verify (6.6-40) and from this deduce (6.6-41). (b) Derive (6.6-43) and thus deduce 
(6.6-44). (c) Derive the analog of (6.6-44) when the number of points is even. [Ref.: Hamming 
(1973), pp. 516-518.] 


*34 (a) Show that the error €, = f(x) — y,(x) in the approximation (6.6-44) can be ex- 
pressed in the form 

1 = sin (L + 4)(x; — x) 

é, = — 


[f (x) — fi] 


(b) Deduce from this that, particularly if f(x) does not change rapidly, the major contribu- 
tion to the error comes from the term or terms with x; nearest to x. [Ref.: Hamming (1973), 
pp. 516-518.] 


35 (a) Suppose f(x) is a periodic function of period 27. Show that the coefficients of the 
discrete Fourier expansion, (6.6-33) and (6.6-34), are what would be obtained if the trapezoidal 
rule were used to approximate the coefficients of the continuous Fourier series for f(x) on 
[0, 27]. 

(b) Show that for any periodic function the trapezoidal-rule correction terms in the Euler- 
Maclaurin sum formula drop out if the interval is a multiple of the period. Does this indicate 
that the trapezoidal rule is exact for periodic functions? Why? [Ref.: Hildebrand (1974), 
p. 454.] 


36 Given the data 


(a) Find the best least-squares Fourier approximation to these data. (b) Calculate the 
residuals at the data points. (c) With m = 3 calculate 52 using both forms of (6.6-35), and discuss 
the difference. 


37 Let f(x) be a function whose discrete Fourier series y,(x) corresponding to a set of 
values {f;}, i= 0,..., 2L, is given by (6.6-36). Let the continuous Fourier-series expansion of the 
observed function f(x) on [0, 27] be given by 

F(x) =44) + > (A, cos jx + B; sin jx) 
j=1 

(a) By summing both sides of the above equation for x = x;,i=0,..., 2L, and using 
(6.6-33), prove that 


[o) 


Ay = Ag + 2) Aiat+1)j 


j=l 
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(b) Now by multiplying the continuous Fourier series by, respectively, cos kx and sin kx 
and summing as above prove that 


00 


a, = A, + 2 (Ane ni- wt Aganstj+n) k>0 


j= 
b, = Bo + Y (Bars rye — Bare j-n) k>0 
j=l 


The importance of this result is that when the function f(x) is sampled at the equally spaced 
points {x,}, the calculated discrete Fourier series includes in its coefficients the effects of higher 
frequencies than the sampling rate. This “folding” back of the higher frequencies on the lower 
ones is often called aliasing. [Ref.: Hamming (1973), pp. 505-508.] 


*38 A function f(x) is called band-limited if its Fourier transform 
F(A)={  f(t)e" 2 de 
vanishes outside the open interval (—Q, 2). 
(a) Show that, if f(x) is band-limited, then 
F(A) = F,(A)P(A) 
where F,(A) is periodic with period Q and 
1 [Al < 
0 [Aj za 
(b) Show that 
F,(A)= y c,e/Mka where c, = alls} 
ko Lew * 20° \2Q 
and thus deduce that 
P(A) _.. 
F({) = e~ Fx/Q)ka 
(4) » ta)ae 2Q 
(c) Show that P(A) is the Fourier transform of 


sin 27OQx 


p(x) = 


TX 


(d) Use part (c) to take the inverse transform of F(A) in part (b) and thus show that 


-F (55) sin 7(2Qx — k) 


we n(2Ox — k) 


This result is known as the sampling theorem and is due to Shannon (1949). It implies that if f (x) 
is sampled at equal intervals with a frequency greater than the bandwidth © of the function, the 
entire function can be reconstructed from this infinite set of samples. When we approximate the 
infinite sum above by a finite sum, we have the analog for nonperiodic functions to the approxi- 
mation (6.6-44) for periodic functions. [See Hamming (1973), pp. 557-559, for a good discussion 
of the sampling theorem and its importance.] 


CHAPTER 


SEVEN 


FUNCTIONAL APPROXIMATION: 
MINIMUM MAXIMUM ERROR 
TECHNIQUES 


7.1 GENERAL REMARKS 


One way of evaluating a mathematical function—trigonometric, logarith- 
mic, exponential, Bessel, etc.—on a digital computer would be to store a 
table of the function in the memory of the computer and to use an interpola- 
tion formula to evaluate the function at nontabulated points. Not only is this 
technique extremely wasteful of the memory of the computer, but also it 
generally has no advantages in speed or accuracy over the techniques to be 
discussed in this chapter. These techniques all involve approximating a func- 
tion f(x) by a rational function.t We noted in Chap. 2 that a rational func- 
tion is the most general function of a variable x that can be evaluated 
directly on a digital computer. But why use rational functions rather than 
the more familiar polynomial approximations (which are, of course, just a 
special case of rational functions)? To answer this we need to consider our 
aims in approximating functions on a computer. 


+ We can, however, use a different rational function on different intervals. 
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The general situation is this: a computation is to be performed in which 
a certain mathematical function is to be evaluated many (perhaps millions of) 
times. It is known a priori that the arguments of this function will be in some 
interval (perhaps infinite), but it is not known a priori what the arguments 
will be. Thus the function must be approximated over the entire interval. 
The property of this approximation of most importance to us is the error 
between it and the true function, an error which will vary over the interval. 

Generally, in the overall computation, it is desirable to be able to bound 
the error in the result. To bound this error without a priori knowledge of 
what the numbers involved will be, we must consider the worst possible case. 
Thus, in an approximation to a function to be used in such a computation, 
the property of the error that is of most importance is the maximum relative 
or absolute error (in magnitude) on the interval. Therefore, the major aim of 
a computer approximation to a function is to make the maximum error as 
small as possible, or, to use terminology introduced previously, we are in- 
terested in minimizing the L,, or uniform norm which is sometimes also 
called the Chebyshev norm. In the last section of this chapter we shall 
develop techniques for generating a rational approximation to a function 
which, among all rational approximations with the same degree polynomial 
in numerator and denominator, has the minimum maximum error. We shall 
call such an approximation the Chebyshev or, more often, the minimax 
approximation. Before this, however, we shall develop techniques for 
generating good, if not minimax, approximations. 

If an approximation is to be evaluated millions of times, another aim of 
a computer approximation must certainly be to achieve maximum speed. 
We shall estimate the speed with which an approximation can be evaluated 
by considering the number of multiplications and divisions required. We 
shall assume for convenience that multiplication and division are equally 
time-consuming although on some computers division is more time-con- 
suming than multiplication. We should also note that floating-point 
addition and subtraction are sometimes as (or almost as) time-consuming as 
multiplication and division. Nevertheless, our conclusions based only on a 
consideration of multiplications and divisions will be generally valid. 

Our reason for preferring rational approximations to polynomial 
approximations is quite simple. For a given amount of computation, ra- 
tional approximations lead to smaller maximum errors than polynomial 
approximations for the functions most commonly approximated on a digital 
computer. This assertion will be illustrated empirically by examples later in 
this chapter. It implies the need to compare the computation required to 
evaluate one rational function with that required to evaluate another ra- 
tional function or a polynomial. Since the means by which rational functions 
are evaluated is of both interest and significance, we shall, in the next section, 
consider the computational aspects of the evaluation of rational 
approximations. 
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7.2 RATIONAL FUNCTIONS, POLYNOMIALS, 
AND CONTINUED FRACTIONS 


Let 
_ Pal) 
Q,(x) 


be a rational approximation to a function f(x), where P,,(x) and Q,(x) are 
polynomials of degree at most m and k, respectively. We shall call 
N=m-+k the index of R,,,(x). The number of coefficients at our disposal in 
Riu{x) is N + 1 since one of the N + 2 coefficients in the numerator and 
denominator may be chosen arbitrarily. In general it is true that the greater 
the index, the higher the accuracy of the approximation. Moreover, for a 
particular function over intervals of interest to us, all approximations with 
the same index, in contrast to approximation with other indices, require 
similar amounts of computation and achieve similar accuracy. In this sec- 
tion we shall consider the computational aspect and not the accuracy of 
rational approximations. 
Let us consider first the evaluation of the polynomial 


Rink(X) (7.2-1) 


PAX) = xX" +a, ,x" 2 +++ a,x + ag (7.2-2) 


To evaluate p,(x) by computing all the powers of x and then multiplying by 
the coefficients and adding would require 2m — 2 multiplications. A much 
better technique is to write 


Py(X) = X(x(-°* (x(X + Gy—1) + An—2) + °°* + a2) + ay) + ao 
(7.2-3) 
and then use Horner’s rule 
Sp = Agg1 +XSp4 1 k=n-—2,...,0, -1 S,-, =1  (7.2-4) 


from which it is easily verified that s_, = p,(x). The number of multiplica- 
tions required using (7.2-4) is n — 1, and the number of additions is n. When 
the polynomial is not going to be evaluated many times, the use of 
Eq. (7.2-4) is the recommended way to evaluate polynomials. But if p,(x) is 
part of a rational approximation to a function to be used in a computer 
subroutine or, in any case, is going to be evaluated very many times, more 
efficient techniques are worth searching for. We shall therefore develop an 
algorithm which generally results in a better computational procedure than 
the use of (7.2-4). 

Our method will involve making a change of variable, whose purpose 
will become clear below, y = x + 6, which converts the polynomial p,(x) to 


any) = y" + by + bay 7A +57 + by + bo (7.2-5) 


288 A FIRST COURSE IN NUMERICAL ANALYSIS 


If the polynomial g,(y) is divided by y? — a, the result is 
Aly) = (y? — a )(y"* + Cp-3y" > + Cy gy” A +0 
+ C1y+Co)+ 1 y+ B, (7.2-6) 
where 
Cj=bjrzt+ C4. jaun—3,n—4,...,0, -1, -2 
Ch-2 = 135 ¢,-1; =9 (7.2-7) 
with c_,; = 1, C_2 = B,. Our interest here is in y,. Using (7.2-7) we write {1} 
Y= c-, = b, + ac, = by +04 (b3 + a1 €3) 
= b, + a,b, + 07(bs + a, cs) 
= b, + 0,b3 + apbs+--° + ay by 1 + Op be,41  (7.2-8) 


where b, = 1 and 2r = n — 2 if nis even and n — 1 if nis odd. Setting y, = 0 
in (7.2-8) gives us a polynomial equation for «,. Suppose this equation has a 
real root. Let a, in (7.2-6) be this real root. Then 


Quy) = (y? — ay )(y""7 + Gy-ay" PF +0 + ery +0) +B,  (7.2-9) 


Before proceeding further we return to (7.2-8) and write the auxiliary 
polynomial for «, in the general form 


u,(y) = by + bsyyt-'+b3,-,y! + bay y (7.2-10) 
If a, is a root of this polynomial, we can write 


u(y) =(y— a )(Cy-1V + cy 3 7 + cy-5y P+ + chy + c}) 
(7.2-11) 


where, using (7.2-10) and (7.2-11), 
Crj—-1 = Daye, $1 C2541 jHrn.., 
Cope =O (7.2-12) 


Comparing this equation with (7.2-7), we see that c;_, = c2,;-, for all j. 
Now consider the polynomial of degree n — 2 in parentheses in (7.2-9). 
Dividing this polynomial by y* — «, and proceeding as above, we get 


Qn(y) = (y? — a (y? — w2)(y"* + d,-sy" > + d,-6y" ° 
+++++dy))+ B,]+ B, (7.2-13) 
where the polynomial corresponding to that in (7.2-8) is 
Yo = Cy + Q¢3 + + 7Cn,- 3 + OD Cr, (7.2-14) 


so that «, must be a root of the polynomial in parentheses in (7.2-11), which 
is to say that it, like «,, must be a root of u,(y). Continuing in this way, if the 
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polynomial corresponding to that in (7.2-8) has a real root at every stage, we 
get finally 


An(y) = (-* {[o(y? =) + By? — 1) + Beat Wy? — 941) + Bs 


(7.2-15) 
where b= Jy(y+ Brat) +%4+1 meven 
ly + O41 n odd 

results from the division of a quartic or cubic by y* — a,. The«,;,j = 1,...,7, 


are the roots of u,(y), which we have thus far assumed to be all real. Note 
also our implicit assumption that b,,,, in (7.2-10) is nonzero [it cannot be 
zero if nis odd (why ?)] for if not, u,(x) would not be a polynomial of degree r 
and therefore could not have r roots. If b,,,, = 0, the process leading to 
(7.2-15) will fail at some point; i.e., no a; will exist such that y, = 0. 

It is an interesting fact, which we leave to a problem {2}, that the roots of 
u,(y) will all be real ifat least n — 1 of the roots of q,(y) have real parts which 
are all nonnegative or all nonpositive. We have then the following procedure 
for determining the change of variable constant 6 introduced above in order 
to assure that q,(y) satisfies the condition stated above. First determine the 
zeros of p,(x). Three cases need to be distinguished: 


1. The zero with largest or smallest real part is a + bi, with b # 0. Then let 
6 = —a. For if so, all zeros of q,(y) have nonnegative or all have non- 
positive real parts. In particular q,(y) will have two zeros bi and — bi, so 
that with a, = —b?, y? — a, is a factor of q,(y) and B, =0 in (7.2-6). 

2. The zero with largest or smallest real part is real, but the next pair are 
nonreal a + bi. Then we proceed just as in case | since only n — 1 of the 
zeros are required to all have nonpositive or all have nonnegative real 
parts. 

3. The two zeros with largest or smallest real parts are real. We express them 
asa + banda — band let 6 = —a. Thengq,(y) has zeros + band —b. With 
a, = b* again we have B, = 0. 


Note that almost always there are two choices of 6, one for the roots 
with positive real parts and one for the roots with negative real parts. In all 
three cases, therefore, we not only achieve a polynomial u,(y) with all real 
zeros but have the additional bonus that B, = 0. In case 2 above it could 
turn out that, with the choice of 6 above and n even, b2,,, = 5,-, = 0. But 
even this case can be finessed; we leave the details to a problem {3}. 

Finally we note that the oraer of the a;, j= 2,..., 7, 1s immaterial, so 
that any of the (r — 1)! orderings can be chosen. This is important because 
the computational properties, i.e., the numerical accuracy, of (7.2-15) for the 
spectrum of necessary values of y may depend on the order in which the a,, 
j=2,...,r are chosen. 
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To summarize, our algorithm to evaluate p,(x) is as follows: 


Input 
x, 6, n (> 3) 
a j=l,...,rtl 


|r+ | n even 


or n odd (B, =0) 


Algorithm 
yo-xt+0 
ze—y? 
if n even then w< y(y + B,41;) + 41 
else w— y + 4, 
endif 
fori=r,...,.2dow<(z—a,)w+t B, 


w — (z — o,)w 


Output 
w (= p,(x)) 
The total number of operations in this algorithm is easily seen to be: 
Multiplications: r+2= 5 + 1 n even 
n+ 1 
r+l= n odd 
2 
Additions: 2r+2=n n even 
2r+l=n n odd 


This can be summarized for any n as [n/2]+ 1 multiplications and n addi- 
tions. Since it can be proved that any algorithm must require at least [n/2] 
multiplications and at least n additions, the algorithm above is very nearly 
optimal. Indeed, for n < 7 it is known that no algorithm can achieve both 


the [n/2] multiplication and n addition lower bounds. 


The advantage of the quadratic-factor algorithm over Horner’s rule as 


regards multiplication is illustrated in the following table. 
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Quadratic-factor | Horner’s Quadratic-factor | Horner’s 
Degree algorithm rule Degree algorithm rule 
3 2 2 7 4 6 
4 3 3 8 5 7 
5 3 4 9 5 8 
6 4 5 10 6 9 


The analytic effort required to express p,(x) in the form (7.2-15) becomes 
considerable as n increases. But it is clearly worthwhile when the polynomial 
in question is part of an approximation to a function which, once found, can 
be used forever. 

This quadratic-factor algorithm by no means gives the best possible 
result. For cases when n is even, it is often (always when n = 4) possible by a 
simple trick to reduce the number of multiplications by 1 {4}. More gen- 
erally, it is known, for example, that it is possible to evaluate any polynomial 
of sixth degree with three multiplications {6}. However, the quadratic-factor 
algorithm is nearly minimal for modest values of n and, for all but very small 
values of n, a distinct improvement over Horner’s rule. 


Example 7.1 Apply the above technique to the polynomial 
ps(x) = x° + 2x* — 9x3 — 12x? + 38x — 20 


The zeros of p.(x) are +1, +1, +2, —3 + i, —3 —i. Therefore, we choose 6 = +3 and 
make the change of variable y = x + 3 to obtain 


qs(y) = y> — 13y* + 57y? — 93y? + 56y — 80 
and, since r = (nm — 1)/2 = 2, we have 
u,(y) = 56 + S7y + y? = (y + 56)(y + 1) 
from which we obtain 
as(v) = [(y — 13)(y? + 56) + 648](y* + 1) 


with B, in (7.2-15) equal to 0. Note that the relatively large value of 8, = 648 should give 
rise to some disquiet concerning the roundoff error which may be incurred in evaluating 


q5(y). 


In order to evaluate R,,,,(x), then, one approach would be to evaluate 
P,,(x) and Q,(x) as described above and then to take their quotient. Another 
approach is to write R,,,(x) in the form of a continued fraction. To convert a 
rational function to a continued fraction, we perform a series of divisions 
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and reciprocations. For example, 


2x3 +x7 4x43 4x +9 
——5 > = 2x + 3 - = 
x*—x +4 x°—x+4 
4 
=2x+3- 
7#-x+4 
x+ 2 
4 
=2x+3- 
x—“¢ + “te 
x+? 


An algorithm for performing this conversion in general is considered in {8}. 
Here we consider explicitly only the cases m = k or k + 1 and leave the other 
cases to a problem {9}. For these cases, the continued-fraction form of R,,,(x) 
is (except in certain degenerate cases {8}) 


Riaz (x) => Cox + Do 


Cy 
+ ———_________- 
x+D,+C, 
x+D,4+C; 
x+D3;+ 
= (7.2-16) 
+ C, 
x + D, 
Ci C2 Ch 
= Cox + Dy +O ah a ret: 


where Cy = 0 when m=k. To evaluate the continued fraction (7.2-16) we 
calculate 
C; J = k, ve ey l 


= at 7.2-17 
j x+D,+d;4, d,4, =0 ( ) 


from which it follows that R,,,(x) = Cox + Do + d,. The computation of 
(7.2-16) requires k divisions and, if m=k-+ 1, one multiplication. In the 
following table we compare the number of multiplications and divisions 
required to evaluate R,,,(x) first by evaluating the numerator and denomina- 
tor polynomials using the quadratic-factor algorithm discussed above and 
second by using continued fractions.t 


+t Note that in evaluating the two polynomials and then dividing, we can assume that one 
polynomial is in the form (7.2-2), but the other will in general have a coefficient multiplying the 
highest power of x which adds one multiplication to the total. Note also that the calculation of 
z= y* need be performed only once if the same 6 is used in the numerator and denominator. 
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Polynomial evaluations Continued fraction 


Note that when m or k = 4, the number of multiplications in the second 
column can be reduced by | since any monic quartic can be evaluated with 
two multiplications {4}. 

Thus, we see that if division and multiplication are equally time- 
consuming, the continued-fraction approach is superior using this compari- 
son, but if multiplication is somewhat faster than division, this may not be 
true. The number of additions and subtractions required by the two 
techniques in the same. 

We considered the continued-fraction formulation with m = k or k + 1 
because these generally give the most accurate approximations among all 
rational approximations with the same index. In conclusion it is only fair to 
point out that the continued-fraction approach, like the quadratic-factor 
algorithm, may lead to certain computational difficulties, the main one being 
loss of significance due to the subtraction of nearly equal quantities. This 
difficulty can often be overcome quite easily {12, 13}, however. 


73 PADE APPROXIMATIONS 


Our first approach toward generating approximations of the form (7.2-1) 
will be, for a given m and k, to choose P,,(x) and Q,(x) so that f(x) and 
P,,(x)/Q,(x) are equal at x =0 and have as many derivatives as possible 
equal at x =0. In the case k =O, the approximation is then just the 
Maclaurin expansion for f(x). Implicit in what follows will be the assump- 
tion that the Maclaurin series for f (x) exists in some neighborhood of x = 0. 
There are two reasons for the arbitrary choice of x =0: (1) it makes the 
manipulations below substantially simpler than for any other x, and (2) 
the interval over which we wish to approximate most functions will contain 
0, and when it does not, a simple change of variable can be used to make 
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the interval contain 0. We also assume that P,,(x) and Q,(x) have no 
common factors. Now let 


PAX) = Says! Q,(x) = Yb b, = 1 (7.3-1) 


It is permissible to let the constant term in Q,(x) equal 1 because (1) the 
constant term cannot be 0 if the approximation is to exist at x = 0 and (2) 
the value of R,,,,(x) is unchanged if numerator and denominator are divided 
by the same constant. 

Now let f(x) have a Maclaurin series 


f(x)= Y ox" (7.3-2) 


Then we consider the difference 


fe) Fy a rae ae aa 


Smee 
It 
io) 


Since we have N + 1 constants (m+ 1 a,s and k b,’s) at our disposal, we 
would hope to make f(x) — R,.,(x) and its first N derivatives equal to 0 at 
x = 0. We shall achieve this if the numerator of the right-hand side of (7.3-3) 
is such that its leading power is of degree N + 1 (why?). Thus we write 


(So (35x! 7 Lay= y dx! (7.3-4) 


j=N+1 


The vanishing of the coefficients of the first N + 1 powers of x on the 
left-hand side of (7.3-4) is equivalent to the equations {10} 


k 
2 ¢n-s- jj) =9 s=0,1,...,N—m-1 


w 
II 
oO 


c,=0ifi<0,b),=1 (7.3-5) 


r 
a, = ) c,_;b; r=0,1,...,m 
j=0 


When this set of N+ 1 linear equations in the N + 1 unknowns has a 
solution, it provides us with the desired approximation of the form (7.2-1) 
{10}. It should also be noted that Padé approximations for a given value of N 
can be computed via recurrence relations from Padé approximations for 
smaller values of N {11} and techniques also exist for computing all approxi- 
mations for a particular N from Ryo(x), also via recurrence relations. 
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One disadvantage of this derivation is that it does not provide us with 
an error term in closed form. Such an error term can be derived using other 
techniques {14 to 19}. But since our emphasis here is on finding the maxi- 
mum error on an interval, error terms which contain a derivative of the 
function which we can bound or estimate but not evaluate are not of basic 
interest to us. Rather we must be able to find the error at any point in the 
interval, and this we shall do by actual evaluation of the approximation and 
comparison with the true function. 

Estimates of the error are useful, however, in indicating how good a 
given approximation ts likely to be in the minimum maximum error context. 
For the case of approximations as we have derived them in this section, it is 
often true that the coefficients d;,, i= N+ 1,...,and b;,i= 1, 2,..., k, in 
(7.3-4) decrease very rapidly in magnitude. Thus a good estimate of the error 
in (7.2-1) may be given by the first term on the right-hand side of (7.3-4), 
which is dy,,x%*', with dy,, given by 

k 


dyer = > Cya1-; (7.3-6) 
j=0 

This approximation to the error illustrates the fact that, in common with 
Maclaurin-series approximations, rational approximations of the kind we 
have developed have errors which are small near 0 and increase away from 0. 
This is just what we would expect because of our requirement that the 
approximation agree with f(x) and its first N derivatives at x = 0. This 
means that in practice if x = 0 is not the center of the interval over which we 
are approximating, we should make a change of variable so that 0 becomes 
the center of the interval. 

The development of approximations of the form (7.2-1) by the method 
of this section is due to the French mathematician Padé. The approximation 
Riu(x) is called the (m, k) entry in the Padé table of f(x). We note the 
important empirical fact that, for most functions for which approximations 
are desired on computers, among all the entries in the Padé table, those for 
m=k or m=k+1 give the smallest minimum maximum error for a 
given N. 


7.4 AN EXAMPLE 


Let us consider the problem of approximating e* on the interval (— 00, 00).t 
The first thing we must consider is the infinite interval. We cannot expect to 
approximate a function over an infinite interval with good error behavior 


+ We shall use approximations to the exponential function for illustrative purposes 
throughout this chapter because they demonstrate the various approximation methods nicely 
and are relatively easy to derive. But we should note that the unique property of the exponential 
function that e * = 1/e* does in some cases lead to approximations which have this property 
and which are therefore atypical. 
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over the whole interval. Our first step then will be to convert the problem to 
that of approximating e* on some finite interval. 
We suppose that the approximation is going to be used on a digital 
computer which is binary internally. 
For any x in (— 00, 00) we write 
xlog,e=X+F (7.4-1) 


where X is an integer and F a fraction such that —1 < F < 1.t From (7.4-1) 
we have 


Q* lone = eX = 2% x QF = 2¥ x ef n2 (7.4-2) 


Since X is an integer, on a binary computer multiplication by 2* represents a 
shift of a fixed-point number or a change in the exponent of a floating-point 
number, both of which are very simple operations. Thus we are left with the 
problem of approximating the exponential over the interval (—In 2, In 2) ~ 
(—.69, .69). 

Thus we have reduced our original problem to that of finding an 
approximation to e* on the interval (—.69, .69). For illustrative purposes, we 
consider the case N = 4. For m= k = 2 we shall now perform the calcula- 
tions of Sec. 7.3 in detail. From the Maclaurin expansion for e*, we have 
Co = 1c, = 1, cc = 4, C3 = Cg = 3g. With these values Eqs. (7.3-5) become 


dat+Gbi+%b.=0 $+4b,+b,=0 
O=1 a=1+b, a=37+b, +b, (7.4-3) 


Solving the first two equations for b, and b, and then using the last three to 
calculate a), a,, and a,, we find 


b,=-4 bo =13 Ag = 1 a, =4 a, = 73 
12 + 6x + x? 
so that R, (x) = Do b6x bx? (7.4-4) 


Similarly we can calculate the other entries of the Padé table for N = 4 {20}: 
Rg o(x) = 1+ x + 4x? + 4x? + syx4* 


24 + 18x + 6x? + x? 


R3, (x) = 4 — 6x 
24 + 6x 
R30) = 7 ag ox 
Ro, a(x) 7 
0 ee er a a er yo 
0.4 l—-x+ 4x? — 4y3 + 3x" (7.4-5) 


+ Actually it would be easy to adjust X so that —$ < F <4, We do not do so in order to 
keep the example in this section as consistent as possible with examples in subsequent sections 
of this chapter. 
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Figure 7.1 Errors in Pade approximations to e*. 
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The approximation R,, o(x) is always just the truncated Maclaurin-series 
expansion of the function. Note also that in this example we have R,,,(x) = 
1/Rim{—X) (why would we expect this ?). 

Our interest 1s in the errors in these various approximations over the 
interval [ — .69, .69]. In Fig. 7.1 we have plotted e* — R,,,,(x), and in Table 7.1 
we have listed some sample values of this error. We note first that over the 
range | —.69, .69] R, ,(x) has the minimum maximum error, consistent with 
our previous assertion that approximations with m = k or k + | would gen- 
erally be best. However, over the larger interval [— 1.0, 1.0] shown in Table 
7.1 R3. ,(x) is the best approximation in a minimum maximum error sense. 
Note also that as the ends of the interval are approached, the errors grow 
very rapidly. This suggests that where possible we shall approximate func- 
tions over quite small intervals. Indeed our assertion about approximations 
with m =k or k + 1 was made with small intervals in mind. 

In general we have two alternatives when an approximation for a given 
N is not sufficiently good over the whole interval: (1) break up the interval 
into two or more subintervals and use an appropriate approximation over 
each subinterval or (2) use a larger value of N to get more accuracy. The 
disadvantage of the latter method is that the greater the value of N, the more 
multiplications and divisions required. But if different approximations are 
used on different subintervals, this is wasteful of memory and the time 
required to find the proper subinterval for a given argument may grow with 
the number of subintervals unless they are chosen very carefully. General 
rules for choosing the value of N and the number of subintervals are quite 
difficult to give. 

With Padé approximations, the error increases rapidly away from the 
center of the interval, but, as we shall see, for minimum maximum error 


Table 7.1 Errors e* — R,,,(x) in Padé approximations to e* 


(m, k) 
x (4, 0) (3, 1) (2, 2) (1, 3) (0, 4) 
~1,00  —.00712 00121 ~.00054 00053 — 00135 
—.75 — .00175 .00033 — .00016 .00018 — .00050 
~69  —.00117 00022 — 00011 00013 ~ 00037 
~50  —,000240 000049 = —.000027 000032  ~.000104 
~.25  —,0000078 0000018  —.0000011 0000014 — —.0000052 
25 0000086  —.0000023 0000018 —.0000029 0000129 
50 000284 — — 000088 000073. = —.000134 000653 
69 00147 —~ 00050 00045 ~ 00088 00463 
75 00225 —~ 00079 00072 ~ 00147 00783 


1.00 00995 — .00394 .00400 — .00899 05162 
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approximations, this is not the case. In the latter case, the argument for 
subintervals, while it still exists, is not nearly so cogent as for Padé approxi- 
mations. Thus on digital computers most approximations are for the whole 
interval. In this case the accuracy desired, 1.e., the maximum error that can 
be tolerated, determines N. 

If we were using the approximation (7.4-4), we would, of course, not 
compute it in that form but in one of the forms of Sec. 7.2. We note that 
R, »(x) requires two divisions to calculate as a continued fraction while all 
the other approximations with N = 4 require at least three operations L.e., 
multiplications and divisions. Because both numerator and denominator 
polynomials have a coefficient of the highest power of 1, R, (x) requires 
only two multiplications and one division if evaluated as the quotient of two 
polynomials. In this form the other approximations require at least three 
operations each, although R, 9(x) requires no division {22}. 


7.5 CHEBYSHEV POLYNOMIALS 


We would not expect Padé approximations to be best or even nearly best in 
a minimax sense. The basis for their derivation—equality of a function and 
its derivatives at a point—as we have seen, does not give good error beha- 
vior over a whole interval. In the remainder of this chapter, we shall develop 
methods which lead to approximations which are better in the minimax 
sense than Padé approximations. In particular in Sec. 7.7 we shall use Padé 
approximations as the starting point from which to develop better approxi- 
mations. As a fundamental tool in much of the rest of this chapter, we shall 
use the Chebyshev polynomials, mentioned briefly in Chap. 4, and which we 
now discuss in some detail. 

Before so doing, however, let us consider the motivation for the use of 
Chebyshev polynomials. The problem with using approximations based on 
Maclaurin series is that the error over an interval centered at 0 is extremely 
nonuniform—small near the center but growing very rapidly near the end- 
puints. It would seem more reasonable to use as approximating functions, 
instead of powers of x, polynomials whose behavior over an interval 
centered at 0 would be in some sense uniform. We would hope that rational 
functions formed from combinations of these polynomials would exhibit a 
more uniform error behavior. As we shall now show, the Chebyshev polyno- 
mials have ideal properties for these aims. 

The Chebyshev polynomial of degree r is given by (see Prob. 26, 
Chap. 4) 


T,(x) = cos (r cos” * x) (7.5-1) 
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These polynomials satisfy the orthogonality relationship 


0 r#s 
7 I =s=0 7.5-2 
[Gay BOOT) dx ={™ TS 7) 
r=s#0 


and the recurrence relationship (see Prob. 26, Chap. 4) 
T,-1(x) = 2x F(x) — Trax) Tolx)= 1 T(x) =x (75-3) 


In this and the next two sections we shall assume that the interval over 
which we wish to approximate a function is [— 1, 1]. This will be convenient 
in the development here and involves no loss of generality. 
From (7.5-1), it follows that in [— 1, 1] 7;(x) has r zeros at 
(27 + 1)x 


=cosst 2" 5 =0,...,r-1 75- 
x = COS * j=O0,...,7 (7.5-4) 


and r + 1 extrema of magnitude | at 
x = cos j=0,...,7 (7.5-5) 


where (7.5-5) includes the r — 1 places where T/(x) = 0 as well as the two 
endpoints. For T,(x) this is illustrated in Fig. 7.2. 

That property of the Chebyshev polynomials of particular interest to us 
here is expressed by the following theorem. 


T(x ) 


1 


Figure 7.2 Graph of T,(x). 
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Theorem 7.1 (Chebyshev) Of all polynomials of degree r with coefficient 
of x” equal to 1, the Chebyshev polynomial of degree r multiplied by 
1/2’! oscillates with minimum maximum amplitude on the interval 


[—1, 1]. 
PROOF From (7.5-3) it follows that the coefficient of x” in T,(x) is 2”~'. 
Thus when r > 0, 


O,(«) = 5-7 F(X) (75-6) 


satisfies the conditions of the theorem. The requirement that the 
coefficient of x” be 1 has the effect of normalizing all polynomials of 
degree r. Without this normalization the theorem would be meaningless. 
The proof is by contradiction. Suppose there exists a polynomial 
q,(x) of degree r with leading coefficient 1 which has a smaller minimum 
maximum amplitude than Q,(x) on [—1, 1]. We consider 
P,—1(x) = Q,(x) — 4-(x) (7.5-7) 
which is a polynomial of degree r — 1. The r + 1 extrema of the polyno- 
mial Q,(x), each of magnitude 1/2’~’, are given by (7.5-5). By our hypoth- 
esis q,(x) has a smaller magnitude than Q,(x) at each of these extrema, so 
that p,_ ,(x) has the same sign as Q,(x) at each of these extrema. From 
the definition of T,(x), it follows that the r + 1 extrema of T,(x) and thus 
of Q,(x) alternate in sign. It follows than that p,-_ ,(x) alternates in sign 
from one extremum of Q,(x) to the next, which means that p,_ ,(x) has r 
zeros in[—1, 1]. But p,_ ,(x) is a polynomial of degree r — 1. Therefore, 
we have a contradiction. Now suppose there exists a polynomial q,(x) 
with minimum maximum amplitude equal to Q,(x). Unless q,(x) = Q,(x) 
at an extremum of Q,(x), we get a contradiction as above. But, if g,(x) = 
Q,(x) at such an extremum, then p,_,(x) has a double zero at this 
extremum, and, proceeding as above to count the zeros of p,_ ,(x), we 
again arrive at a contradiction, which completes the proof of the 
theorem 


The Chebyshev polynomials are sometimes called equal-ripple polyno- 
mials because they oscillate between positive and negative extrema of the 
same magnitude. In the next two sections, we shall use the Chebyshev poly- 
nomials to derive approximations which are superior to Padé approxima- 
tions in the minimax sense. 


7.6 CHEBYSHEV EXPANSIONS 


The expansion of f(x) in a series of Chebyshev polynomials is given by 


Fe) = deo + Yes Tu) (7.6-1) 


7 iv 
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where, using the orthogonality property of the Chebyshev polynomials, we 
can write the coefficients in (7.6-1) as 
25° f(x)T(x) , 

o=al Woy j=0,1,... (7.6-2) 
The series in (7.6-1) converges uniformly whenever f(x) is continuous and of 
bounded variation in [—1, 1]. Our first object is to approximate f(x) by 
truncating the expansion (7.6-1) after a finite number of terms. We shall call 
the approximation formed by truncating at j =m, T,, 9(x). Having cal- 


culated T,, o(x) using (7.6-1), we shall then convert it to polynomial form by 
writing each T,(x) in its polynomial form. 


Example 7.2 Let f(x) = e*. Use (7.6-1) to calculate T, (x) and compare this with R, (x) 
in (7.4-5). 

With f(x) = e*, the integrals in (7.6-2) can be evaluated to give [see Watson (1962, 
p. 20, eq. 5)] 


c; = 21,(1) (7.6-3) 


where I {x) is the modified Bessel function of the first kind. Using this result and truncat- 
ing (7.6-1) after five terms, we have 


Ty o(x) = 1.266066 + 1.130318T7, (x) + .271495T,(x) 
+ .044337T,(x) + .005474T,(x) (7.6-4) 
which, converted to a polynomial, becomes 
Ty o(x) = 1.000044 + .997310x + .499200x? + .177344x* + .043792x*  (7.6-5) 


However, it should be noted that for maximum accuracy T, ,(x) should be evaluated for 
particular values of x using (7.6-4) with the values of the Chebyshev polynomials cal- 
culated using the recurrence relation (7.5-3). Indeed, the evaluation of any T,,,(x) should 
use the Chebyshev polynomial rather than powers-of-x form. In Fig. 7.3 and Table 7.2 the 


Table 7.2 Errors in various polynomial approxi- 
mations to e* 


x ex — Rg (x) e* — Ty o(x) eX — C4 o(x) 
— 1.00 — 00712 — .00050 .00069 
—.75 — .00175 .00047 .00069 
— .50 — 000240 — .00023 — .00024 
—.25 — 0000078 — 00052 — .00050 
0 0000000 — .00004 .00000 
.25 0000085 .00051 .00050 

.50 .000284 .00032 00028 
75 .00225 — .00050 — .00019 


1.00 .00994 .00059 00214 


FUNCTIONAL APPROXIMATION: MINIMUM MAXIMUM ERROR TECHNIQUES 303 


e°-R, (x) 
~—o-—0- e*- 4,0(*) 
ee *~ 4,0(%) 


———-e*-R,,(x) 


—2x107? 


Max. «ne _ 
[-1,l¢ ~ 4.0(*)|¥5.47 X10 ‘ 


Figure 7.3 Errors in various polynomial approximations to e*. 


errors in this approximation and R, 9(x) are compared on the interval {— 1, 1].t From the 
figure and table the improved accuracy of T,. 9(x) is clear. As we might expect, this is 
particularly notable near the endpoints, thus bearing out our hope that the smoothness of 
the Chebyshev polynomials would result in improved error behavior near the endpoints. 
For later reference (see Sec. 7.8) we note that the error has six maxima and minima, 
including the endpoints. Starting from — 1, the values of these extrema are (multiplied by 
10*) — 5.03, 5.08, — 5.28, 5.56, — 5.82, 5.92. 


+ We could compare over the interval [—.69, .69] by changing variable from x — x/.69 in 
(7.6-5). 
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We shall now use the expansion (7.6-1) to generate rational approxima- 
tions in a manner analogous to that used to generate Padé approximations 
from Maclaurin expansions. We wish to find approximations of the form 


) a;T,(x) 
T(x) = =? (7.6-6) 


2X6 j T(x) 


where the a,’s and b,’s are to be determined so that in 


the coefficients of T;(x), j = 0,..., N, in the numerator of the right-hand side 
vanish [cf. (7.3-3) and (7.3-4)]. Thus we write 


ro 9) k m 0 
be + dic 74) | » bs TI) — YiajT(x)= ), Aj T(x) (7-6-8) 
j=1 j=90 j=o0 jJ=N+1 

In order to get equations for the a;s and b,’s, we use the identity {25} 
Ty.) + Ty) = 2T(<) Too) (7.6-9) 


and with this rewrite (7.6-8) as 


k 1 2 k 
30 » b, T,(x) + = » bie Ti+ (x) + Ti (x)] 
j=0 2 j=1 i=0 
— YajT(x)= QL hy T(x) (7.6- 10) 
j=o0 jJ=N+1 
From (7.6-10) we get the following set of equations for the a,s and b;'s {25}: 
l k 
dy = 5 di biGi 
k 
_ 1/2 » bile, — iy + Cy +i) r= L, ceey N (7.6-11) 
ro i=0 
0 r>m 


These are N + 1 equations in the N + 2 constants a, and b,. This is what we 
would expect since one coefficient in (7.6-6) is arbitrary. We shall set by = 1 
in all cases where this leads to a system (7.6-11) which is solvable. 

The first nonzero coefficient on the right-hand side of (7.6-8) is given by 


hye == 
N+1 2 | 


IM = 


bi(eny +1 -i + Cy+1+i) (7.6-12) 


0 
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Table 7.3 Errors in various rational approxima- 
tions to e* 


x e* — Ry ,(x) ex — T,, (x) e* — Cy 2(x) 
— 1.00 — .00054 — .000042 — .000032 
—.75 — .00016 000047 .000036 
—.50 — .000027 — .000018 — .000027 
—.25 — .00000 1 1 — .000070 — .000063 
0 0000000 — .000020 .000000 
.25 .00000 18 .000088 000105 
50 .000073 000084 .000073 
715 .00072 — .000122 — .000163 
1.00 .00400 000189 000239 


In analogy with (7.3-6), this coefficient multiplied by Ty , ,(x) is often a good 
approximation of the error in T,,,(x). 


Example 7.3 Let f(x) = e*. Find T, ,(x) and compare with R, (x) as given by (7.4-4). 
Equations (7.6-11) are 

Ay = 4(Co + b,e, + b,c) (bo = 1) 

a, =C, + 4[by (co + C2) + b2(cy + €3)] = ag = Cp + ALD, (C, + 3) + 52(Cq + €4)] 

O=c, + 4[b,(c, + cg) + b2,(c, +e5)] O= cy, + 4[b,(c3 + cs) + bA,(c, + C6)] 


where for; = 0,..., 4 the c,’s are given by (7.6-4) and c, = .000543, c, = .000045. Solving 
the last two equations for b, and b, and then solving the first three for ag, a,, and a,, we 
get 


dy = 1.0009875 
b, = —.4783387 


a, = .4825306 
b, = 0387418 


a, = 0397096 


Converting T, (x) to a rational function, we gett 


1.0000205 + .5019781x + .0826200x? 


T. _ 
2. 2() 1.0 — .4976173x + .0806064x? 


(7.6-13) 

In Fig. 7.4 and Table 7.3, the errors in this approximation are compared with those 
of R, ,(x) over [— 1, 1]. Again we note that the Chebyshev approximation is substantially 
better than the Padé approximation, particularly near the ends of the interval. Note, 
however, that T,. ,(x), while it also has six maxima and minima, including the endpoints, is 
not nearly so smooth as T, 9(x) over the whole interval. Since no condition at x = 0 is 
imposed in deriving T,,,(x), it is not surprising that neither T,, ,(x) nor T, (x) gives exact 
results at x = 0. 


+ Here and hereafter in this chapter, we shall normalize our approximations by setting 

9 = 1 in order to conform to the notation of (7.3-1). However, in actual computational prac- 

tice, we would always normalize the approximation so that the coefficient of some, usually the 

highest, power of x in the denominator (or numerator) is equal to 1, for so doing always saves at 
least one multiplication (cf. Sec. 7.2). 
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—2x10"‘ 
Max. . ~5 
[-1,1] |e*“Rz.2(=)|*8.69 X10 
Figure 7.4 Errors in various rational approximations to e*. 


Our examples have illustrated the general empirical result that rational 
approximations derived using the Chebyshev expansion of a function give a 
smaller maximum error than Padé approximations. The chief drawback to 
the approach as we have presented it thus far is the necessity of evaluating 
the integrals (7.6-2), which cannot be done analytically in most cases. One 
way to avoid evaluation of the integrals (7.6-2) is to use trigonometric inter- 
polation to approximate the first L + 1 coefficients in (7.6-1). We make the 
change of variable x = cos 6, and remembering that T;(cos #) = cos j@, we 
see that (7.6-1) becomes 


g(0) = f (cos 0) = 4c9 + y c; cos j6 (7.6-14) 


j=l 
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and (7.6-2) becomes 
2;* 
cj == | g(0) cos j6 dé (7.6-15) 
mae) 


Equation (7.6-14) is just the Fourier-series expansion of g(@). Therefore, the 
Fourier and Chebyshev expansions of a function are related to each other 
through the change of variable x = cos 8. 

Our interest here is in approximating g(@) by 


L 
y,(9) = 4% + YG; cos j@ (7.6-16) 


which is precisely equivalent to (6.6-36) since b; = 0 for all j when the func- 
tion being approximated is even. Therefore, applying the techniques of 
Sec. 6.6-2, we get approximations Gp), C;,..., €, to the true coefficients cy, c,, 
veey Che 

Another approach which avoids evaluating the integrals (7.6-2), which is 
easy to mechanize, and which can achieve as much accuracy as desired is to 
use the Maclaurin expansion of f(x) to calculate the coefficients in the 
Chebyshev expansion. This can be done by substituting the Maclaurin series 
into (7.6-2) and integrating term by term {27}. The resulting infinite series for 
c; usually converges sufficiently rapidly to make the calculation of the c,s 
quite feasible. In fact, this procedure can be mechanized on a digital com- 
puter so that the computer in effect generates the approximations to be used 
by the computer. This procedure does, however, involve a great deal of 
calculation even in the polynomial case (k = 0), and when we try to extend 
the method to rational functions, the amount of calculation becomes very 
great. In the next section we present another method of improving on Padé 
approximations using Chebyshev polynomials; this leads to approximations 
which are generally almost as good as those of this section and much easier 
to generate. 


7.7 ECONOMIZATION OF RATIONAL FUNCTIONS 


Our object here is to take the Padé approximations of Sec. 7.3 and perturb 
them so that the resulting approximation has a smaller minimum maximum 
error on the desired interval. Without losing generality, we shall assume that 
the interval is of the form [—a, «] since a change of variable can always be 
used to bring any other interval into this form. We shall first consider the 
problem for k = 0, that is, for polynomial approximations, and then shall 
extend it to general rational functions. 
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7.7-1 Economization of Power Series 


Given N, we begin with the Padé approximation with m = N, k = 0, which is 
just the first N + 1 terms of the Maclaurin expansion of f(x). We write this 
as 


N 
Rw. o(x) = >, d;x! (7.7-1) 
j=0 
In order to get an improvement over this approximation, we shall use 
N+1 
Ry+1,0(x) = ), 4x! (7.7-2) 
j=0 
a’ +1 x 
Then Cw, o(X) = Ry+1.0(X) — dws 1 >N_ Ty +1 (*] (7.7-3) 


is a polynomial of degree N since the leading term of the Chebyshev polyno- 
mial Ty4(x/e) is x%**2%/a%**. Moreover, since |Ty+,(x)| <1. for 
x €[-—1, 1], the error in Cy 0(x) is greater than that in Ry +1. o(x) by no 
more than (dy, ,«%*')/2%. Since « will be less than or equal to 1 in virtually 
all applications and dy, , will be decreasing as N increases, this added error 
will be very small in general. What we have done then is to take the power- 
series approximation of degree N + 1 and from it derive an approximation 
of degree N whose maximum error is very little greater than that of the 
(N + 1)st degree approximation. Thus we have “economized” the power 
series in the sense of using fewer terms to achieve almost the same result. 

Our object here, though, is to compare Cy o(x) with Ry (x); that is, we 
wish to compare corresponding approximations of the same degree. We let 
x = au, so that the interval for u is [—1, 1]. We have from (7.7-1) 


mn fe) oe = dys ,un*! (7.7-4) 


From (7.7-2) and (7.7-3) 


dy (0% **/2")Ty 4 1(u) + . y d (ou) 


. f (au) — Cy o(au) . =N+2 
im ——__-——. = _ lim ————- : 
aw xNel a yNel 
Ty + 1(u 
= dys Trail) (7.7-5) 


Comparing (7.7-4) and (7.7-5) and using Theorem 7.1, we conclude that, in 
the limit as « +0, Cy o(x) has a smaller maximum error on [—a, a] than 
Ry. o(x). Therefore, for sufficiently small «, the economized approximation 
Cw. o(X) is a better approximation than Ry 9(x) in the minimax sense. In 
practice “sufficiently small «” includes almost all intervals of interest. 
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Example 7.4 Let f(x) = e*. With x =1 find C, 9(x) and compare with R, (x) and 


Ts, o(X). 
We have 
x? x3 xt 3 
Ry o(xJ=ltx+ > +E +R +759 T;(x) = 16x5 — 20x? + 5x 
so that (7.7-3) becomes 
x? x4 
C4 o(X) = Rs. o(x) — ro(tk)(16x° — 20x3 + Sx) = 1 + 382x +> + 4x3 +3 


In Fig. 7.3 and Table 7.2, the errors in this approximation are compared with those of 
R4 o(x) and T, o(x). We note that (1) C, o(x) is a much better approximation than 
Rg (x) and (2) near x = 0, Cy o(x) and T, o(x) are about equally good, but near the 
endpoints, T, 9(x) is substantially better. But as a-—0, the approximation using the 
Chebyshev expansion and that using the economized power series would be more nearly 
equivalent over the whole interval. 


7.7-2 Generalization to Rational Functions 


Corresponding to (7.7-4), we have for any Padé approximation 


Ryu (X) = Prn(x)/Qi(x) 
Ff (au) — Ryng(ou) = im WyN+3 (7.7-6) 


lim N+l 


a—+O od 
The superscripts have been added to d",) for later reference. Our object is, 
in analogy with (7.7-5), to find a rational approximation C,,,(x) such that 


lim f (au) a Cut (Ott) _ dyn. 4) Ty + 1(u) (7.7-7) 


20 oN ti! ON 
because if we can do so, we shall again have a better approximation than 
R,u(X) for sufficiently small a. 
To find such a C,,,,(x), we use a sequence of Padé approximations to f (x) 
of the form 
Pix) 


RY _ (x)= jf =0,...,N-1 (7.7-8) 


he W) (x) 


where the only restriction on i is that 0 < i <j. Therefore, i is not uniquely 
determined except when m=0, k =O and, for all m and k, when j = 0. 
Analogous to (7.7-6), we have 


— RY. son; 
f (au) RY, _ (au) _— 0 Dt (7.7-9) 


gi tt 
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Now we define 


—1 
P(x) + & pyar Pi(x) + Yo 
j=0 


C(x) = ne (7.7-10) 
Q,(x) + y yj+1O (x) 
j=0 
. dim) Nai | 
with Ver = aa N Cia j=0,...,N-1 
aj, 2 
di: kK)yN + 1s 
Yo = (7.7-11) 


where tf; is the coefficient of w’ in Ty 4 ,(u); then although we leave the algebra 
to a problem {31}, it can be shown that C,,,(x) as defined in (7.7-10) satisfies 
(7.7-7). 

From (7.7-11) it can be seen that this procedure will fail if d%,/-? = 0 for 
any j. When this happens, however, it will often be possible to choose 
another member of the sequence (7.7-8). Note also that the foregoing deriva- 
tion requires the use of only those y,, , for which t,, , is nonzero. Thus, when 
t;4 1 = 0 [as it is for every other coefficient of Ty ,,(u)], the value of d‘y/,” is 


J 
immaterial {31}. 


Example 7.5 Let f(x) = e*. With «=1 find C, ,(x) and compare with R, ,(x) and 
T,, (x). 

For the sequence (7.7-8), we choose R),(x), R4o(x), R(x), and RY, (x). Perform- 
ing the calculations analogous to those in Sec. 7.4, we find 


RQ?o(x) = 1 RYYo(x) = 1 + x 


1+ 4x 1 + 4x + 4x? 
RY, (x) =——— RY, (x) = — 
and rewriting R, ,(x) so that by) = 1, we have 
1+ 4x + bx? 
Ry, (x) = —j7—7 5 
1 — 5x + 75x 
Then using (7.3-6), we calculate 
d'0-9) = 1 dp: =4 df Y= —z4 d= —4; d?. a, 
Further T;(x) = 16x> — 20x? + 5x 


so from (7.7-11) with N= 4 anda = 1 


i 
Yo =0 = ys et 
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Then from (7.7-10) 

(1 + 4x + q5x7) + de(1 + 3x) + 30a 

(1 — 3x + yyx7) + de(1 — 3x) + aoa 

_ 2353 + 1176x + 192x? 1.0 + .49978751x + .08159796x? 
~ 2353 — 1176x + 192x? 1.0 — .49978751x + .08159796x2 


In Fig. 7.4 and Table 7.3 the errors in this approximation are compared with R,_,(x) 
and T, ,(x). As with the economized power series, the economized rational function is not 
quite so good as that derived using a Chebyshev expansion, but it is much superior to the 
Padé approximation. 


This completes our discussion of the economization of rational func- 
tions. Although we have come close, we have not yet succeeded in deriving 
true minimax approximations. In the next section, we shall state the theorem 
which gives the characterization necessary for the true minimax approxima- 
tion to a function over an interval. 


7.8 CHEBYSHEV’S THEOREM ON MINIMAX 
APPROXIMATIONS 


Let f(x) be the continuous function we wish to approximate over the finite 
interval [a, b] in the form (7.2-1). Let 
mk = max | f(x) — Ray) | (7.8-1) 
a<sx<b 
for any rational function 


— Pl) _ 
Rmul*) = Q,(x) 7 


(7.8-2) 


Sane. 


iMaliMes 


Then we can provet the following theorem. 


Theorem 7.2 (Chebyshev) There exists a unique rational function R,,,(x) 
which minimizes r,,, (two rational functions are considered identical if 
they are equal when reduced to their lowest terms). Moreover, if we 
write this unique rational function as 


Dd, Aj+vX 
2 P¥(x) 
R* —J=0 om 7.8- 
mal) was j Or (x) ( . ») 
Dj+u 
j=0 


+ Chebyshev proved only the characterization and uniqueness parts of the theorem. The 
existence portion was proved much later by Walsh. 
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where O<usk Os<vem_— a,,b, #0 (7.8-4) 


and P*(x)/Q}#(x) is irreducible, then if r*, # 0, the number of consecu- 
tive points of [a, b] at which f(x) — R*,(x) takes on its maximum value 
of magnitude r*, with alternate change of sign is not less than L = m+ 
k + 2 —d, where d = min (uy, v). 


In particular, the theorem says that when k = 0, the number of points at 
which the error attains its maximum magnitude is at least m + 2. Note how 
nearly T, (x) as given by (7.6-5) meets this requirement. 

We shall leave the existence and uniqueness parts of the proof to the 
problems {37, 38} and here prove only that R,,,(x) has the characteristic 
form stated in the theorem. We note here that both this theorem and 
Theorem 7.3 below are true if we consider approximations to f(x) of the 
form s(x)Ri»u(X), where s(x) is a given continuous function which does not 
vanish in (a, b) {39}. 


PROOF We suppose that L, the number of points at which f(x) — R*,(x) 

takes on its maximum value r*, with alternating sign, is such that 
C<L 

so that EC<m+k+i1-d (7.8-5) 


Note, for example, that if f(x) — R*,(x) has the form of Fig. 7.5, the 

number of points at which it takes on its extreme value with alternating 

sign is only 3 since the extrema labeled 2 and 3 have the same sign. 
We can then subdivide the interval [a, b] into ’ subintervals 


[a, x1}, [x1, x2), «5 Deva 5] (7.8-6) 

such that in alternate intervals the inequalities 
— Tink SS (%) — Rite() < Tike — & (7.8-7 
and —r* +a<f(x)— R*,(x) < r%, (7.8-8) 


f(x)-R py (x) 


Figure 7.5 An example of alternating extrema of f(x) — Rw, (x). 
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are satisfied for some positive number a. In Fig. 7.5 the points a, x1, X2, 
b divide [a, b] into subintervals in which (7.8-8) and (7.8-7) are alter- 
nately satisfied. 

We consider the function 


A(x) = (x = x4 )(x — 362) (= Xy-1) (7.8-9) 
Since we have assumed that P*(x)/Q#(x) is irreducible, we can write {36} 
A(x) = Qf(xalx) — PX(x)b(x) (7.8-10) 


where a(x) and b(x) are, respectively, polynomials of degree less than or 
equal to m and k. 

We shall now display a function R,,,,(x) which achieves a smaller 
minimum maximum error than R*,(x) on [a, b] under the assumption 
that L < L. This contradiction will prove the theorem. Let 


__ Pa(x) — Ba(x) 
Oi (x) — Bb(x) 
where f is a real number. The numerator is then a polynomial of 


degree < m, and the denominator is a polynomial of degree < k. Now 
we can write 


R(X) (7.8-11) 


_ — f(x) — R&(x) + ———_PAe) 
by making use of (7.8-10), (7.8-11), and (7.8-3). Since Q*(x) must be 
bounded away from zero in [a, b] [for, if not, R*,(x) will certainly not be 
a minimax approximation], by choosing 8 sufficiently small in magni- 
tude, the denominator of the last term on the right-hand side of (7.8-12) 
will also be bounded away from zero in [a, b]. The numerator changes 
sign at each point x; because of (7.8-9) and is of constant sign in each 
subinterval (7.8-6). Thus we can choose f so that 


(7.8-12) 


BA(x) 

OFNON) — Abo <*  *EE* PL EB) 
and such that this term has sign opposite to the maximum error in 
f(x) — R*,(x) in each subinterval (7.8-6). With # chosen in this way 
(7.8-7), (7.8-8), and (7.8-13) imply that R,,,(x) in (7.8-11) has a smaller 
minimum maximum error than R*,(x) on [a, b], which is the desired 
contradiction. Therefore, we have proved that LE >m+k+2—-d. 


Theorem 7.2 gives us no way of constructing minimax approximations. 
Nor does it enable us to judge how close to the minimax approximation a 


+ That such a subdivision exists is not quite trivial; see Davis (1963, pp. 149-151). 
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given approximation is. The plot of e* — R*¥ ,(x) in Fig. 7.3 implies that 
T,. o(x) is very nearly a minimax approximation. Conversely the plot of 
e* — R¥ (x) in Fig. 7.4 indicates that T, 2(x) deviates a good deal from the 
minimax approximation. We might surmise then that if L alternating ex- 
trema of the error in a given approximation are of nearly equal magnitude, 
this approximation achieves nearly as small a maximum error as the mini- 
max approximation. The following theorem confirms this. 


Theorem 7.3 Let 


Y ajeyxi 
P,,.(X a 7" 
Rint (X) = O(n) S = (7.8-14) 
: 2 bie aX! 
j= 


be irreducible and let the difference 
F(x) — R(X) 
remain finite in [a, b]. Further, let 
Xp <XQ2<'0< Xx, 
be in [a, b] and let 
f(x) — Rank(X:) = (— 1); i=1,...,L (7.8-15) 


where A; > 0 for all i and L=m+k+2-—d with d= min (y, v). Let 
Sink(X) = Pm(X)/g,(x) be any rational approximation to f(x) with degrees 
of numerator and denominator less than or equal to, respectively, m and 
k and let 


Sme = max | f(x) — Siu(x)| (7.8-16) 
[a, b] 
Then Smk = min A; (7.8-17) 


We leave the proof of this theorem to a problem {38}. 

By identifying S,,,(x) with R*,(x), this theorem says that the error in the 
minimax approximation is greater than the magnitude of the smallest ex- 
tremum in the error of R,,,(x). Since the minimax error is certainly smaller 
than the magnitude of the largest extremum in the error of R,,,(x), we 
thereby obtain an upper and lower bound on the minimax error from any 
approximation R,,,(x) which satisfies the conditions of Theorem 7.3. For 
example, from Fig. 7.3 the six extrema of T, 9(x) vary in magnitude between 
5.03 x 10°* and 5.92 x 10° * with the minimax error given by 5.47 x 107. 
In Fig. 7.4 the extrema of T> ,(x) vary in magnitude between 4.20 x 107° 
and 18.88 x 1075, and the minimax error is equal to 8.69 x 107°. 

It remains now only to consider some means for constructing the mini- 
max approximation. This is the subject of the last section of this chapter. 
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7.9 CONSTRUCTING MINIMAX APPROXIMATIONS 


A number of algorithms are known by which the true Chebyshev approxi- 
mation to a function for a given m and k can be calculated. All of them 
proceed by generating a sequence of approximations which in the limit 
converges to the Chebyshev approximation. In this section, we shall present 
the two algorithms which are the most commonly used to generate mini- 
max approximations on computers, namely the second algorithm of Remes 
and the differential correction algorithm. 


7.9-1 The Second Algorithm of Remes 


For simplicity, we shall assume in this section that the minimax approxima- 
tion to f(x) on [a, b] of the form (7.8-3) has at least 


N+2=m+k+2 


points at which the extreme value of the error is attained. That is, we assume 
that d in Theorem 7.2 is 0.t We also assume, without loss of generality, that 
the interval [a, b] includes x = 0 so that we can set by = 1. The problem then 
iS: 


Given: A continuous function f(x) on an interval [a, b] including zero and m 
and k 

Find: R*,(x), the minimax approximation to f(x) on [a, b] of the form 
(7.8-3). 


Let us suppose that we have somehow obtained (see comments after 
the algorithm) an approximation 


m . 
y a;x! 


R(x) = 29 — by = I (7.9-1) 


Yb; x! 
j=0 


such that f(x) — RO(x) has N + 2 extrema which alternate in sign. Then the 
second algorithm of Remes is as follows: 


1. Let x0) < x) <+++ <x), be N +2 points at which f(x) — R“&)(x) has 
local extrema which “alternate i in sign. 


+ Degeneracy, that is, d # 0, can occur when f(x) is even or odd on [a, b] {40}. Except in 
trivial cases like this, it occurs in only some quite pathological cases. But near degeneracy, in 
which P*(x) and Q#*(x) have a nearly common factor, is more common and can cause severe 
computational difficulties. 
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2. Solve the system of N + 2 nonlinear equations 


ax) 
= (-1)E 
b (xl 


for the N + 2 unknowns ao, ..., Gd», b;,..., b,, and E. Call the solution 
a, ..., a), b?’, ..., bo’, E®. Note that |E©| is the magnitude of the 
error in the approximation at each of the points x‘. 

3. Define 


i=0,...,N+1 


7.9-2 
by = 1 (7.9-2) 


Sane 


=> 
om 
oS 
| 
a 
| H 
shat 


a® xJ 


ita 


bo = 1 (7.9-3) 


=" 
oO 
cas 

I 
= 
Rae 

| 
aay 


b)x/ 


it 


a 


The function ho(x) then has magnitude | E®| with alternating sign at the 
points x, i=0,..., N + L. Therefore, it is not hard to show that in the 
neighborhood of each x{° there is a point x{!) at which ho(x) has an 
extremum of the same sign as that of f(x) — R(x) at x{. Replace each 
x by the corresponding x‘). If x, the point at which ho(x) has its 
maximum magnitude, is one of the points x‘), proceed to step 4. If not, 
replace one of the points x‘’) by x in such a way that ho(x) still alternates 
in sign on the points x‘". This can always be done (why?). 

4. Repeat steps 2 and 3 using the points x}, ..., x), in (7.9-2). This process 
then generates a sequence of rational approximations of the form (7.9-1) 
which converges uniformly to R*,(x) if the initial extrema x!°, i=0,..., 
N + 1, are sufficiently close (see below) to the corresponding extrema of 
Rin(). 


We make the following comments on this algorithm: 


1. If k =0, the iteration will converge for an arbitrary choice of the N + 2 
abscissas in Step 1. That is, an initial approximation of the form (7.9-1) is 
not necessary [Novodvorskii and Pinsker (1951)]. However, when k + 0, 
all that can be said is that there exists an « > O such that if each extremum 
of f(x) — RO(x) lies within ¢ of the corresponding extremum of f(x) — 
R*,(x), the algorithm will converge. Thus what is really required is not an 
approximation R(x) but a set of extrema lying sufficiently close to the 
corresponding extrema of f(x) — R*,(x). For example we might use the 
N + 2 extrema of the Chebyshev polynomial Ty, ;(x), suitably related to 
(a, b]. But in many cases, in order to obtain a set of N + 2 x{® for which 
the algorithm will converge, it is necessary first to derive an approxima- 
tion RO(x). When f(x) — Cyy(x) has N + 2 extrema, C,,,(x) can almost 
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always be used for R(x). In some particularly difficult cases a technique 


due to Werner (1962) can be used to generate an appropriate R‘)(x). 

. When k = 0, the equations of step 2 of the algorithm are, in fact, linear 
(why?) and thus are easily solvable (see, for example, Sec. 9.3). When 
k #0, the nonlinear system can be solved as follows: 

a. Write the system (7.9-2) as 


af [6 (!) — (—1YE 1S bo 


j=1 


J 


=f (x) — (- IVE, +1 (7.9-4) 


r=0,1,... bo = 1 
b. Starting from an assumed initial value of Ey, solve the linear system 
(7.9-4) in the unknowns ap, ..., @,,5;,...,5, and E,,, forr = 0, 1,..., 
until two successive values of E, are in agreement. In the absence of 
other information (cf. Example 7.6 below), Ep = 0 can be used. In 
practice, convergence of this method has seldom proved to be a 
problem. 
. The problem of finding the extrema of h(x) is computationally intract- 
able if we wish the exact solution. However, since the extrema of ho(x) will 
be close to those of f(x) — R(x) (and similarly at subsequent stages of 
the algorithm), it is sufficient in practice to search in the neighborhood of 
x) with a mesh of points until the approximate location of the extremum 
is found. Then x‘! is chosen, using a single stage of linear or quadratic 
inverse interpolation (cf. Sec. 3.6) as the point where the derivative of the 
error is zero. 
. This algorithm is also applicable to the case in which an approximation 
of the form s(x)R,,,(x) is desired, where s(x) is a given continuous func- 
tion which does not vanish in (a, b) {39}. Choosing s(x) = 1/f(x) and 
choosing the function to be approximated to be identically 1 enables us to 
compute the best relative minimax approximation to f(x) on any interval 
on which f(x) does not vanish since 


Rit (X) _ a (x) — x 
f(x) ~ f (x) Li ( ) Rint ( )] 


Example 7.6 Let f(x) = e*. Starting with C, ,(x) as found in Example 7.5, use the above 
algorithm to find the minimax approximation to e* on [—1, 1] with m =k = 2. 

Arbitrarily, we divide the interval [—1, 1] into 201 points, equally spaced at an 
interval of .01. The six of these points at which e* — C, ,(x) has its extreme values are (see 
Fig. 7.4) 


— 1.00, —.80, —.27, +.35, +.82, + 1.00 


Of these extrema, the one at + 1.00 has the greatest magnitude, 2.39 x 10° *. We expect 
the maximum error in the minimax approximation to be somewhat less than 2.39 x 10° ¢. 
(In fact, we know from the previous section that r¥, < 1.89 x 10~*.) Therefore, as Eo in 
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(7.9-4), we choose —2.0 x 10~*, the minus sign being used because the error at — 1.00, 
that is, i = 0 in (7.9-2), is negative. Then solving (7.9-2) by the algorithm given above, we 
get 


1.00007407 a, = .50802883 a, = .08549199 
— .49166131 b, = .07792980 E = —.745 x 10°4 


Qo 


by 


The maximum error on [—1,1] is .978 x 10°* at x = 47333. The six extrema of this 
approximation are at 


— 1.00, — .73773, — .13475, 47333, 86488, 1.00 


The next stage of the algorithm gives 


A, = 1.00007275 a, = .50864603 a, = .08583370 


b, = —.49108231  b,=.07770411 E=—.867 x 10-* 


Now the maximum error is .870 x 10° * atx = —.11898. Note that this maximum error is 
very little greater than E, so that the process has almost converged. The extrema are now 
at 


— 1.00, — .72601, — 11898, 47363, 86571, 1.00 
A third application of the algorithm gives 

ay = 1.00007255 a, = .50863618 a, = .08582937 

b, = —.49109193 b, = .07770847 E = —.8689990 x 1074 


The maximum error is now .8689996 x 10~* at x = .47357. Thus, for all practical pur- 
poses, these coefficients give the Chebyshev approximation to e* on [—1, 1]. 


7.9-2. The Differential Correction Algorithm 


In this algorithm we assume that the function f(x) is defined not on an 
interval but on a discrete point set X = {x,, x2, ..., x,}, and we attempt to 
find R*,(x) which is such that 


max | f(x;) — R*,(x;)| < max | f(x) — Rink (X:) | (7.9-5) 
1l<i<l l<i<l 
for all R,,,(x). We note first that this problem is not significantly different 
from that of finding the best approximation on an interval [a, b] because, as 
noted in the previous section, the implementation of the second algorithm of 
Remes results in searching for the extrema of h(x) on a mesh of points in 
[a, b]. This mesh is essentially equivalent to the point set X. Or to put it 
another way, we expect that the best approximation on a rather dense point 
set in [a, b] will normally be very close to the best approximation on the 
interval. 
To describe the differential correction algorithm we first define 


P (x;) 


Beit) (7.9-6) 


f (xi) 


r) = max 
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where the R&)(x) = PS (x)/Of(x) are a sequence of approximations to 
R*,(x). Starting from an initial approximation of the form (7.9-1), this algo- 
rithm considers at each step 


| f(x:)Q.(x;) — P,,(X;) | — rh w(X;) 


i QO} (x;) 
Emi(Xi)| — 7 Sh)O,(x; 
where —s En, (x) = f (x) — na (7.9-8) 


At each step we seek R's,‘ !)(x) which minimizes w in (7.9-7) subject to the 
additional condition that Q,(x;) > 0 for all i. The second form of w in (7.9-7) 
indicates that what we seek are coefficients a; and b, such that 


So) = max | emu(%:)| 
t 


is as much less than r‘) as possible, or, putting it another way, we wish 
(7.9-7) to be as negative as possible. That a negative value can be found at 
each step is certain, for otherwise r}, would equal r*, 

To perform the computation which minimizes w we first note that 
(7.9-7) can be written 


wQ'(x;) > | f(x; — P,,(x;)| —rShQ,(x;) i=,...,1  (7.9-9) 
which is equivalent to the two inequalities 


WOK (Xi) + rink Dai) — LF Oi )Qu(%:) — Pali) 
WOK(X;) + Fink Dili) + LF (%i)Q«Ci) — Poul i) 


Our object is to minimize w subject to the constraints (7.9-10), the con- 
straints Q,(x;) > 0, and the normalizing condition by = | (or alternately, as 
is often done with this algorithm, max, |b;| = 1). This is a linear program- 
ming problem and can be solved by any method for the solution of such 
problems such as the simplex method (see Sec. 9.10). 

The major advantage of the differential correction algorithm over the 
second algorithm of Remes is that it converges for any starting approxima- 
tion R©(x) as long as Q((x;) > 0 for all i. However, it should be noted that 
when the Remes algorithm does not converge, the number of points / will 
probably have to be large in order for the differential correction algorithm 
to lead to a solution close to that for the continuous interval. This results 
in a large linear programming problem and therefore a great deal of com- 
putation. Another, occasionally useful property of the differential correction 
algorithm is that it quite easily accommodates constraints on the approxima- 


0 
>0 (7.910) 
9- 
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tion of the form R,,,(x;) = f(x;) at selected points x;. Both algorithms have 
their devotees, and both are in quite common use for calculating best 
rational approximations. 


BIBLIOGRAPHIC NOTES 


Section 7.1 The need to approximate functions on digital computers has given a great 
impetus to the well-established mathematical field of approximation theory. The books by 
Achieser (1956), Cheney (1966), Davis (1963), Fike (1968), and Rice (1964, 1969) all contain 
valuable material. Some of the earliest work on approximations for digital computers was done 
by Hastings (1955). The most up-to-date compilation of approximations for digital computers 
is contained in the handbook by Hart et al. (1968). Recent surveys of the field have been given 
by Kogbetliantz (1960), Stiefel (1959), Cheney and Southard (1963), and Cody (1970). 


Section 7.2 The algorithm for evaluating polynomials is discussed in Knuth (1969). The 
standard text on continued fractions is that of Wall (1948). For techniques of converting 
rational functions to continued fractions, see Maehly (1960a). 


Sections 7.3 and 7.4 The Padé table is extensively discussed by Wall (1948). Much of the 
material of these sections can be found in Kogbetliantz (1960). Lawson (1964) considers the 
problem of using different approximations on different subintervals. 


Section 7.5 The classic reference on the Chebyshev polynomials and their many applica- 
tions is Lanczos (1956). For more recent treatments see Snyder (1966) and Fox and Parker 
(1968). 


Section 7.6 Much of the material in this section is due to Maehly and is discussed by 
K ogbetliantz (1960). Minnick (1957) discusses the use of the Maclaurin series to evaluate the 
coefficients in the Chebyshev expansion, and Spielberg (1961) uses this technique to mechanize 
the procedure on a digital computer. Powell (1967) shows that, for polynomial approximations, 
the Chebyshev expansion results in a minimum maximum error not much larger than the best 
approximation. Luke (1969, 1975) contains the coefficients for the Chebyshev expansion for 
many mathematical functions. 


Section 7.7 The concept of economization of power series is due to Lanczos (1938) [see 
also Lanczos (1956)] and is discussed by Kogbetliantz (1960). The extension to rational func- 
tions was made by Maehly (1960b) and in the form presented here by Ralston (1963). 


Section 7.8 The proof of Chebyshev’s theorem here is from Achieser (1956). 


Section 7.9 For rational functions the proofs of the convergence of the second algorithm 
of Remes and the related exchange algorithm of Remes are given by Ralston (1965). Werner 
(1962) has given a proof of the convergence of a modified form of the second algorithm. A proof 
of the convergence of the exchange algorithm for a class of functions including polynomials but 
not rational functions has been given by Novodvorskii and Pinsker (1951). The differential 
correction algorithm is discussed in detail by Barrodale, Powell, and Roberts (1972). 
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PROBLEMS 


Section 7.2 


1 (a) Verify Eq. (7.2-7). (6) Use (7.2-7) to derive (7.2-8). 
*2 With q,(z) as in (7.2-5) let 


gn(Z) = 7! + b,- 3 gn-2 foe ob bicmod 3) 2imes 2) 
_ n-1 n-3 nen (n—- 1)(mod 2) 
h,(z) = 6,-1.2 + b,_ 32 + + BDin- 1)tmod 2)4 ™ 


Assume q,(z) has n — | zeros with nonnegative real parts and that h,(z) is not identically zero. 

(a) Show by induction that if q,(0) = 0, then g,(z) has at least n — 2 pure imaginary zeros 
and h,(z) has at least n — 3 pure imaginary zeros. 

(b) Derive the same result if q,(t+iy) = 0 for some y. 

(c) Now assume q,(z) has no roots with zero real part. By considering the path in the 
complex plane consisting of a semicircle in the left half plane of radius R for sufficiently large R 
and the diameter of this semicircle on the imaginary axis, show that the results of parts (a) and 
(b) also hold in this case. 

(d) Deduce from this that the squares of the zeros of g,(z) and h,(z) are all real and, finally, 
from this deduce that if u,(y) as defined by (7.2-10) is not identically 0, it must have all real zeros. 


3 (a) Show that if all zeros of p,(x) have the same real part, the quadratic factorization of 
p,(X) 1s easy. 

(b) Show that cases | and 3 on p. 289 always lead to nonzero b,- , when n is even unless 
all zeros of p,(x) have the same real part. 

(c) Show that if case 2 on p. 289 leads to b,-, = 0 when n is even, then unless u,(y) is 
identically 0, 6 can be found for which b,_, #0. 


4 (a) Show that the choice 5 =(a,_,—1)/n leads to a polynomial q,(y) in which 
b,, =1. 

(b) If, for this choice of 6, u,(y) has all real roots, show that the number of multiplications 
for the quadratic factor algorithm is n/2 when n is even and (n + 1)/2 when n is odd. 

(c) Show that if u,(y) does not have all real roots, the quadratic-factor algorithm may fail 
at some point but that when this happens, at most two applications of synthetic division will 
enable the quadratic-factor algorithm process to be continued. 
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5 (a) Apply the quadratic-factor algorithm as presented in Sec. 7.2 to the polynomial 
(x + 7)(x? + 6x + 10)(x? + 4x + 5)(x + 1) 
(b) Apply the algorithm of Prob. 4 to this polynomial and the polynomial of Example 7.1. 
*6 (a) Given A, B, C, D, E show how the equations 
2p+1=A 
pPpt+il)+2qt+a=3B 
p(2qg+a)+qtr+s=C 
p(r+s)+r+q(q+a)=D 
ar+q(r+s)=E 


can be solved for a, p, gq, r, and s by finding a real root of a certain cubic equation in gq. 


(b) Then defining b = q- ac,c = p—a,d=s— bc,e =r — be, f= F — rs, show that 
P(x) = x® + Ax> + Bx* + Cx? + Dx? + Ex + F 
can be evaluated with three multiplications by the algorithm 
y = x(x + a) 
w = (y + b)(x +c) 


P(x)=(w+y+d\wtret+f 


(c) Show how this algorithm can be used to evaluate x© + 13x5 + 49x* + 33x? — 61x? — 
37x + 3. [Ref. : Knuth (1962).] 


*7 Consider the following algorithm for computing x" (when no lower powers are 

required): 

(a) Write n as a binary number, for example, 9 = 1001. 

(b) Cancel the high-order 1 (001). 

(c) Replace each 0 by S (square) and each 1 by SX (square, multiply by x) (001 > SSSX). 

(d) Starting from the left with x compute by squaring or multiplying by x as specified in 
the “code word” {x? = [(x?)?]?x}. 

Prove that this algorithm is valid for any n. Hint: Use induction. [Ref.: Knuth (1962).] 


*§ (a) Show that any rational function 


2 a,x! m>k 
Rul) = b £0 
sy b,x! k # 
j=0 
can be written 
k . 
m-k-1 2, Pyxt? 
Rin (X) = +( y 5x} 418 qo = 1 
ine Y gx 
j=0 


Find a recurrence relation for the p;’s and q,’s in terms of the a,’s and b;’s. 
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(b) Except in certain degenerate cases, show that we can write 


k 
xk-d 
be C, 
Eo SD > eeo—> do =4, = 1 
2 gx? (x + Y ax} | ( ¥ ax} 
J=0 j=l j=l 
Find D, and C, in terms of p,, p,, and q,. When do the degenerate cases occur? 
0 1 Po> Pi q1 
(c) Derive recurrence relations for the p; and q; in terms of the p, and q,. Indicate how you 
would solve these relations. 
(d) Thus deduce the correctness of (7.2-16) in nondegenerate cases. 
9 (a) Use the previous problem to derive the relation analogous to (7.2-16) when 
m>k+l. 
(b) Do the same for m < k by considering 1/R,,,(X). 
(c) In both cases, how many multiplications and divisions are required to evaluate R,,,(x)? 
(d) Use the recurrence relations of the previous problem to transform (x* — 5x? + 
12x? — L1x + 2)/(x? — 3x + 4) to the form of part (a). 


Section 7.3 


10 (a) Derive Eqs. (7.3-5) using (7.3-4). 
(b) For m = k = 1, show that the Padé approximation to cos x does not exist. 


11 Let the Padé approximation R,,,(x) be written 


Pax) _ Pore 
Ra) = Oy OR 


Assuming the existence of all relevant Padé approximations, show that the following recurrence 
relations exist: 


(m.k) _. p(m—-1.k) (m-1.k-1) (m.k) —_ yim 1.) (m- 1. k-1) 
Pree= Pr, — DxPY-, Q; = Qi — DxQyr* 


(m.k) — p(m.k-1) (m—1.k-1) (m. k) —_ Aum. k-1) (m-1,k-1) 
Pre? = Pir — ExPy, Oe = Qt — ExQyn | 


where D and E are constants which also depend upon k and m. Find expressions for D and E in 
terms of the Maclaurin coefficients and the constants in the Padé approximations. [Ref.: Wall 
(1948), p. 15 and chap. 20.] 


*12 (a) Show that there exists a Padé approximation R,,,(x) to cos x which contains only 
even powers of x for any m and k. 

(b) Thus find Rg 6(x) = R3. 3(z) (z = x’). 

(c) In the continued-fraction form of R3. 3(z) given by (7.2-16), calculate Dy. From the 
value of D,, deduce that in calculating cos x using this approximation there will be a loss of 
significance of about two digits. 

(d) For R, ,(z), calculate d, as given by (7.3-6). 


*13 In the notation of Prob. 12, consider adding a term —&b,z* to the numerator of 
R, 3(z), where b, is the coefficient of z* in the denominator. 
(a) Calculate b,,b,,b,, and a, in terms of € by requiring, as in Prob. 12, that d, = 0,j = 0, 
..-, 6, in (7.3-4). 
(b) For this approximation, calculate Cy and Do in (7.2-16) in terms of €. 
(c) Calculate d, in terms of €. 
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(d) Calculate those values of € for which D, = 0. For each value of & calculate d,. Which 
value of € would be the best one to use? [Ref.: Kogbetliantz (1960), pp. 14, 25.] 


14 Reciprocal differences. Define the kth reciprocal difference of f(x) as 


) Xo — X, 
Px(Xq 5 Xp> 085 X= 
ee * Pr—1(X05 -++5 Xe—1) — P(X 15 ---5 Xe) 
+ Py—a(%1s -++s Xe 1) k>2 
Xo — X1 
,x,)=—-—_ Xo) =f (x 
Pr(xo, x1) (>) — f(x) Po(Xo) =F (Xo) 


(a) Show that if the two end arguments of p, are interchanged or if any two interior 
arguments are interchanged, the value of the reciprocal difference is unchanged. 

(b) Deduce then that if the value is unchanged if the first two arguments are interchanged, 
the value of a reciprocal difference is independent of the order of any of its arguments. [Recipro- 
cal differences are in fact symmetric in all their arguments, but the proof of this result is quite 
difficult; see Milne-Thomson (1933) for one such proof.] 


15 (a) Show how a reciprocal-difference table can be set up analogously to a finite- 
difference table. 

(b) Given values of In x at an interval of .1 from x = .3 to .9 (see Example 3.3), calculate 
the reciprocal-difference table. 


16 (a) Use the definition of reciprocal differences with x, = x to derive the identity 


xx], x= 
f(x) =f(x1) + 7 + SO 
[PilX1,%2) | P2(X1, X25 X3) 
xXx—-xX 
7 as Se) 
| Pa(, Xia sees Xn) ~ Pn-2(X1, sree Xn-1) 
where PulX 15 X20 0-0s Meas) = Pups Xz 0-9 Meas) — Pe— aKa Xa5 0-5 Xe) 


(b) Show that even if the last term on the right-hand side in this identity is deleted, the 
above is still an equality when x = x,, i= 1, ..., n. This interpolation formula is called Thiele’s 
interpolation formula. 

(c) Use this interpolation formula and the table of Prob. 15b to estimate In .54 (cf. 
Example 3.2). Do this by computing successive convergents of the continued fraction, i.e., using 
first one reciprocal difference, then two, etc. 


*17 Reciprocal derivatives. Define the jth reciprocal derivative of f(x) at a as 


R,f(a)= lim p(xq, .... X41) 


Rye ween xX, + 17a 


(a) Show that R, f(a) = 1/f'(a). 
(b) Using the definition of ¢, in Prob. 16, show that 


lim P(X q, «+s Xya1) = Rj f(a) — Ry-2 f(a) 


Xyevvee Bye a 
= jim ———— ee 
Xj+17a Pj-s(Xj41 a, eons a) —_ p;-1(4, eoeg a) (6/0x)p ;- (x, eee, x) x=a 


if the various limits exist. 
(c) Thus deduce that 


R, f (a) - R,-1 f(a) = jR,R;_, f(a) 
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(d) Assuming that f(x) has reciprocal derivatives of all orders, use Prob. 16 to show that 
as n— 00, 
a| x—a| x—a| 


f(x) = fla) + Ba *ToRRi fla) TARR f(a) 


This result, known as Thiele’s theorem, is the continued-fraction analog of Taylor’s theorem. 


*18 (a) Show that the reciprocal derivatives of f(x) can conveniently be calculated using 
the recurrence relation 


j+l 
Ri(x) 


and that r;_, therefore becomes the jth denominator in Thiele’s theorem. 

(b) Use Thiele’s theorem to derive a continued-fraction expansion of In (1 + x) about 
x = 0. Derive a general form for r (0). Use the first four convergents of this result to estimate 
In .54, and compare with the result of Prob. 16c. 

(c) Similarly, derive a continued-fraction expansion about x = 0 for e*. Find the general 
form for r,(0). Use the first five convergents of this expansion to approximate e. 


R(x) = R,- 2(x) + r;- s(x) r(x) = Ro(x) =f (x) R_,(x)=0 


*19 (a) Write Thiele’s interpolation formula as 


P(x) 
Q(x) 


where P(x) and Q(x) are the polynomials that result when the continued fraction with terms 
through that in @,_ , is converted to a rational function. By considering 


J (x) = y(x) + E(x) = + E(x) 


(z) 
F(z) = OES) — ye] — DOALL) — YO 
with p,(x) = (x — x,) °°: (x — x,) show that 
p,(x) d" 
E(x) = = 
x)= “9? (x) dx" POM. 
with ¢ in the interval spanned by x,, ..., x,, and x. 
(b) By taking the limit as x; a, i = 1,..., n find the error incurred by truncating the 


continued-fraction expansion in Prob. 17d after the term in (n — 1)R, R,_ ,(a). 

(c) Deduce from the limiting form of the error term that the limiting form of the rational 
function in part (a) when a = 0 is identical to a Padé approximation with the same degree 
polynomials in numerator and denominator and thus find a general form for the error in a Padé 
approximation. [Ref.: Probs. 14 to 19, Milne-Thomson (1933).] 


Section 7.4 


20 (a) Derive each of the Padé approximations (7.4-5). 

(b) Use (7.3-6) to estimate the errors in the approximations (7.4-4) and (7.4-5). In each 
case determine how good these estimates are by comparing them with the values of the errors in 
Table 7.1. 

21 (a) Find all Padé approximations for (i) f(x) = sin x with N = 5; (ii) f(x) = cos x 
with N = 4. 

(b) Display an algorithm for the calculation of sin x on (— 00, «) by first showing how 
any value of x outside of [—(2/2), 2/2] can be reduced to this interval. Then show how one of 
the approximations of part (a) can be applied with an argument less than 2/4 in magnitude 


FUNCTIONAL APPROXIMATION: MINIMUM MAXIMUM ERROR TECHNIQUES 327 


followed by an adjustment of the sign if necessary. Do the same for cos x on (— 00, 00). What is 
the advantage of keeping the argument of the Padé approximation small? 

(c) Draw a graph of the error on [ —(7/4), 2/4] for each of the approximations of part (a). 

22 (a) Use the technique of Probs. 8 and 9 to express each of the approximations (7.4-4) 
and (7.4-5) in the form of a polynomial plus a continued fraction. 

(b) Use the results of Sec. 7.2 to determine how many multiplications and divisions are 
required to evaluate each of these five approximations (i) in continued-fraction form; (ii) in 
rational-function form. 

(c) Which approximation can be computed most rapidly if the division time is (i) the same 
as the multiplication time; (ii) 14 times as great as the multiplication time; (iii) twice as great as 
the multiplication time? 


Section 7.5 


23 (a) By making an appropriate change of variable from [—1, 1] to [0, 1], derive the 
shifted Chebyshev polynomials T*(x) which are such that 


6 
T*(x)=cosr0 x= cos’ = 


(b) Find a recurrence relation for the shifted Chebyshev polynomials and use it to gener- 
ate T*(x),r=1,..., 6. 

(c) State and prove a theorem similar to Theorem 7.1 for the shifted Chebyshev 
polynomials. 


*24 (a) With the Chebyshev polynomial of degree i written as 


T(x) =), tx’ 
j=0 
show that the nonzero ¢“ are given by 
4(i — j) 3(i — J) 


(b) Similarly, writing 


T#(x) = > ul?x! 
j=0 


Afi) Yow 


show that 


ul! — 92i-1 


[Ref.: Lanczos (1956), pp. 454-457. ] 


Section 7.6 


25 (a) Derive the identity (7.6-9). (b) Use this result to derive (7.6-10) and (7.6-11). 

26 (a) With f(x) = e*, use trigonometric interpolation to calculate y,(0) as given by 
(7.6-16) with L = 4. Compare this result with (7.6-4). 

(b) Convert the result of part (a) to a polynomial analogous to (7.6-5). Plot the error in 
this approximation on [—1, 1] and compare with the error in T,_ (x). 
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*27 (a) Let f(x) =). a,x/ be the Maclaurin series for f(x) and let T,(x) = Vig tx’. 
By substituting these series in (7.6-2), show that 


Con = » RYPa,,; Cont = > RY? Q2j4+1 
j=0 j=0 
where 
(2j)! 2j +1 
Re” — — Vr tian) + tpn) 
27 27-1? 2j +2 
+ 1" (2) + 1)(2j + 3)-°: (Qj + 2n- 2-1 
(2j + 2)(2j + 4) --> (2) + 2n) 
Rintd) — (2)! rian ay 2h +1 cane 1y (2) + 1)(2) + 3) 
hay Lal O1) +2 % = (jf + 2/27 + 4) 


n fone) 2) + 1)(2) + 3) °°: (27+ 2n+ >| 
ont h (2j + 2)(2j + 4) --° (27 + 2n + 2) 


(>) Approximate the coefficients in (7.6-4) by truncating the series for c,, and c,,,, after 
the term in j = 3, and then convert the approximation to a power series. 
(c) As in Prob. 26b, compare this approximation with T, (x). [Ref.: Minnick (1957).] 


28 (a) Derive the Chebyshev expansions for sin x and cos x. By drawing a graph of the 
error on {—(7/4), 2/4] compare the approximations obtained by truncating these expansions 
with the corresponding results of Prob. 21 [see Watson (1962, p. 21) for the integrals (7.6-2)]. 

(b) Calculate T, ,(x) for f(x) = sin x. Convert the result to rational-function form. Draw 
a graph of the error on [ —(7/4), 2/4] and compare the result with R, ,(x) derived in Prob. 21. 

(c) Calculate T, ,(x) for f(x) = cos x. Convert the result to rational-function form, draw 
a graph of the error on [— (7/4), 2/4], and compare with R, ,(x) derived in Prob. 21. 


29 For f(x) = e*, calculate T, ,(x) and T,. (x). Draw a graph of the error on [—1, 1] and 
compare with the corresponding Padé approximations. 


Section 7.7 


30 (a) When the Maclaurin expansion of f(x) contains only even powers or only odd 
powers, how should (7.7-3) be modified? 

(b) Use this result to calculate C, (x) for f(x) = sin x with « = 2/4. Draw a graph of the 
error on [ —(z/4), 2/4] and compare with the corresponding results of Probs. 21 and 28. 

(c) Calculate C, (x) for f(x) = cos x with a = 2/4. Draw a graph of the error on [ — (2/4), 
m/4] and compare with the corresponding results of Probs. 21 and 28. 


31 (a) Show that with C,,(x) defined as in (7.7-10) and (7.7-11), f(x) — C,,,(x) satisfies 
(7.7-7). 

(b) For f(x) = e* derive an approximation C, ,(x) with « = 1 as in Example 7.5, using as 
the sequence (7.7-8) (i) RY 9(x), RY o(x), RP o(x), RS o(x); (ii) RO o(x), Ry g(x), RYP2(x), RQ o(x). 
Why doesn’t the choice of the second and fourth rational functions make any difference? In 
each case compare the resulting approximation with T, ,(x), R2 (x), and the approximation of 
Example 7.5. 


32 (a) Calculate C, ,(x) and C, (x) for e* with a = 1. By drawing graphs of the errors 
on [—1, 1], compare the errors with the corresponding Padé and Chebyshev approximations. 
[Use any convenient sequence (7.7-8).] 

(b) Show that the derivation of C, ,(z) for sin (./]z])/,/]z] on [—(x2/16), 22/16] re- 
quires substantially less effort than the derivation of C, ,(x) for sin x on [—(/4), 1/4]. Is 
xC, (x?) equal to C, (x)? 
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(c) Actually derive xC, ,(x?) with « = 2/4. By drawing the graph of the error on [ — (1/4), 
x/4], compare this approximation with corresponding Padé and Chebyshev approximations. 

(d) Similarly derive C, ,(x*) for cos x on [—(z/4), 2/4] and compare this approximation 
with corresponding Padé and Chebyshev approximations. 


*33 A common iterative method to compute /a for a in (0, 00) is (cf. Prob. 10, Chap. 2) 


, a 
Xn+1 = 2|X, + — 
x 


This iteration converges very rapidly if the initial approximation x, is sufficiently close to Ja 
(see Sec. 8.4). To get a good initial approximation, it is convenient to use a rational approxima- 
tion. We begin by writing a = 102" x b, where 01 <b <1. 

(a) Derive a Padé approximation R, ,(x) to ./x on [.01, 1] by first making a change of 
variable so that the new interval is centered at the origin. 

(b) Similarly, derive C, (x) with a = .495 using R{?9(x), Ryo(x), R(x), RE (x). By 
drawing a graph compare the errors in the approximations of parts (a) and (b). [Ref.: Kogbet- 
liantz (1960), pp. 33-35.] 


34 (a) How must the development in Sec. 7.7 be modified if the interval [—«, «] is 
replaced by the interval (0, a]? 

(b) Use these modifications to derive an economized approximation C$ ,(x) to e* on 
[0, 1]. Draw a graph of the error. 


*35 The t method. Consider the differential equation 


d"y d"ly 
Ly) = p,(x)— + Pa-1(x)——=> +°°° + Polx)y + P(x) = 0 
dx dx 
¥9)=yo yOO)=y, f=1,...,n-1 


where p,(x) is a polynomial of degree d; and P(x) is a polynomial of degree d. 
(a) Assume a solution 


y(x)= > a,x! m>n 
and let D = max (d,dg+m,d,+m-—1,...,d,+m—n) 


Show that substituting this solution into the differential equation leads in general to a system of 
D + 1+-n equations in m + | unknowns if the initial conditions are to be satisfied. 
(b) Show that, in general, the differential equation 


does have a solution of the form of part (a), where the t,’s are real numbers and T¥_,,, is the 
shifted Chebyshev polynomial of degree m — n + i (see Prob. 23). Thus deduce that if the r,’s 
are small, the solution of this differential equation is a good approximation to the solution of 
L{y) = 0. 

(c) Use this method with m = 4 to approximate e*. Draw a graph of the error on 0, 1] and 
compare this with the error in R4 (x) on (0, 1]. [This method is of particular value when f(x) 
does not have a convergent polynomial or continued-fraction expansion.] [Ref.: Lanczos (1956), 
pp. 464-469. ] 
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Sections 7.8 and 7.9 


36 Let P*(x)/Q}#(x) be irreducible where P*(x) and Q*(x) are polynomials of degree m and 
k, respectively. Prove that any polynomial A(x) of degree m + k or less can be written 


A(x) = Qi(x)a(x) — Pr(x)b(x) 


where a(x) and b(x) are polynomials of degree less than or equal to m and k, respectively. 
*37 (a) Let r be the greatest lower bound of all r,,, in (7.8-1) for given f(x), [a, b], m, and k. 
Why does an infinite sequence of rational approximations R(x) = P“(x)/Q((x) exist such 
that r+ ras i oo? 
(b) Normalize Q{°(x) so that S‘*_, (b)? = 1 and prove that for this normalization 
Y7'=0 (a\)? is bounded. Thus deduce that a subsequence {R‘(x)} exists such that 


n+ na 
for some constants A, and B;. 
(c) Let 
dA; 
R(x) = a and r= max | f(x) — R(x)| 
asx<b 
» Bx! 
j=0 


Prove that {R“(x)} converges uniformly to R(x). 
(d) Use this result to show that 7 <r, and thus deduce the existence of the Chebyshev 
approximation. 

*38 Suppose there exist two rational approximations R“)(x) and R)(x) which minimize 
rime aS given in Eq. (7.8-1). Let L, and L, be, respectively, the number of alternating extrema of 
the errors in RU (x) and Rii(x). Assume L, > L, and let o,,..., «,, be the abscissas of the error 
extrema for R‘?)(x). 

(a) Define A(x) = RO}(x) — R2)(x). Show that 


sgn Lf («;) — RY2(a))] = sgn A(a,;) 


if A(a,) # 0. 

(b) Suppose A(a;_,) #0, A(a;)=0, j=i,..., i +k, and A(a;4,41) #0. Show that the 
number of zeros of A(x) (counting multiplicities) in [a,_ ,, 04,41] is even ifk is even and odd ifk 
is odd. Thus deduce that the number of zeros (counting multiplicities) in this interval is at least 
k + 2. 

(c) Thus deduce that A(x) has at least L, — 1 zeros in (a, b). 

(d) From this deduce a contradiction and, therefore, the uniqueness of the Chebyshev 
approximation. 

(e) Prove Theorem 7.3. 

[Ref.: Probs. 37 and 38, Achieser (1956), pp. 52-57.] 

39 (a) Show that the results of Sec. 7.8 and the previous three problems are unchanged if 
we consider s(x)R,,,(x) as the approximation to f(x) where s(x) is continuous and does not 
vanish in (a, b). 

(b) How do the algorithms of Sec. 7.9 need to be modified for approximations of the 
form s(x)R,,,(x) where s(x) is a continuous function which does not vanish in (a, b)? 
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40 Consider approximating f(x)=x?-— 1 on [—1, 1] by a rational function with 
m = k = 1. Show that the Chebyshev approximation of this type is a constant and thus deduce 
that in Theorem 7.2 d = 1. 


41 Calculate the Chebyshev approximation to e* on [—1, 1] of the form 


Ay + a,x 
1+b,x 


by (a) calculating the Padé approximation R,_,(x); (b) calculating the economized approxima- 
tion C, ,(x); (c) calculating R*¥ ,(x) using either of the algorithms of Sec. 7.9. 


CHAPTER 


EIGHT 


THE SOLUTION OF 
NONLINEAR EQUATIONS 


8.1 INTRODUCTION 


The remaining three chapters of this book will be concerned with the solu- 
tion of linear and nonlinear equations and systems of equations. In this 
chapter we shall be concerned with the solution of nonlinear algebraic and 
transcendental equations. It might seem illogical to discuss the solution of 
nonlinear before linear equations, but the relation between the two subjects 
is tenuous indeed. And since the solution of simultaneous linear equations 
provides the gateway to many of the advanced topics in numerical analysis 
not discussed in this book, it seems reasonable to leave linear equations to 
the last. 

This chapter is divided naturally into two main sections which consider 
two not unrelated but mainly separate problems: (1) the search for real roots 
of the equation 


f(x) =0 (8.1-1) 


where x is a real variable and f(x) is any reasonably well-behaved function, 
and (2) the search for real and complex roots of 


P(x) =0 (8.1-2) 


where P(x) is a polynomial. This is not to say that we are never interested 
in complex roots of the general equation f(x) = 0, but this is a comparatively 
rare problem. 
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Except in Sec. 8.8, we shall restrict ourselves to single equations. One 
reason is that this case is more common than that of simultaneous nonlinear 
equations. An equally important reason, however, is that the solution of 
simultaneous nonlinear equations is a difficult problem about which much 
current research is centered. 

In the case of the solution of simultaneous linear equations, the problem 
is not to find a closed form for the solution but rather to find an efficient 
algorithm for computing the known solution. For the single nonlinear equa- 
tions of interest here, we assume no solution in closed form can be found, for 
if so, it would generally be easy to compute the solution, e.g., quadratic 
equations. Thus we must seek methods which lead to approximate solutions. 
Our technique throughout this chapter will be to develop iterative 
techniques for the solution of (8.1-1) and (8.1-2) (cf. Sec. 5.5-1). In consider- 
ing these iterative techniques, we shall wish to answer two basic questions: 
(1) Does the iteration converge? (2) If so, how fast does it converge? Perhaps 
surprisingly to the reader, the first of these questions will occupy us much 
less than the second in our consideration of (8.1-1). This is because the 
convergence question is in One sense very easy to answer and in another sense 
too hard to answer. It is easy to answer because there is generally little 
difficulty in showing that if the initial approximation(s) to a root a of (8.1-1) 
are sufficiently close to the root, the iteration will converge to «. (In a very few 
cases the iteration will converge independent of the initial approximation.) 
But the phrase “sufficiently close” points the way to the hard part of the 
question. How close the initial approximation(s) have to be to « for conver- 
gence depends generally upon the value of a derivative of f(x) at some 
unknown point on the interval spanned by the initial approximation(s) and 
a. This presents the same problems we have found previously in the estima- 
tion of error terms. 

Moreover, in practical problems, enough a priori knowledge of the 
desired root of the equation is often known to ensure that convergence of the 
iterations is not a problem. When a priori knowledge is poor, it is often 
advisable to use a method which converges independent of the starting 
values (but, alas, usually slowly) until a good approximation is obtained and 
then to switch over to a more rapidly converging method. Thus we conclude 
that among those methods whose convergence depends upon the initial 
approximation, it is difficult to compare the convergence properties of var- 
ious methods and often it is not significant. 

However, it is both possible and important to compare the relative rates 
of convergence of various iteration methods. As we shall indicate, analogous 
to the case of the numerical solution of ordinary differential equations, the 
computational efficiency of a method depends upon the number of evalua- 
tions of f(x) and its derivatives. 

In the special case of polynomial equations, good a priori knowledge 
often is not available and, moreover, whereas often only a single root of 
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f (x) = 0 is desired, all the roots of P(x) = 0 may be desired. We shall there- 
fore place rather more emphasis on methods whose convergence is assured 
independent of the starting value in the case of polynomial equations than in 
the general case. 

It is worth noting that if an iterative method for the solution of (8.1-1) 
converges, the only limitation on the accuracy of the root is in the number of 
digits carried in the computation. That is, the roundoff error in a single 
iteration is the only inherent limitation on the accuracy. This seldom creates 
any problems in the solution of (8.1-1). 

The methods we now discuss all yield a single real root of (8.1-1). In cases 
where we are interested in more than one root, once we have computed the 


first m roots, a1, ...,@,, We can compute the (m + 1)st root, a,,,,, by apply- 
ing any of these methods to the function 
x 
Sin(X) = F(x) 


I (x — a) 


In addition to this method of implicit deflation or suppression, it is pos- 
sible to work with the original function f(x) if there exists some knowledge 
about the location of the roots we are seeking. In this case, it may turn out 
that an iterative method will converge to the root closest to an initial 
approximation, although there is no guarantee for this. The only time we can 
guarantee convergence to a root different from previously computed roots, 
1, ..-, &m, is if we have an interval [a, b] such that f(a) f(b) <0 and 
a, ¢ [a, b], i= 1,..., m. In this case, the methods of bisection (Sec. 2.2) and 
false position and its modifications (Sec. 8.3) will find some root «,,4, in 


[a, b}. 


8.2 FUNCTIONAL ITERATION 


Let f(x) be a continuous real-valued function with as many derivatives as 
are required in what follows. Furthermore, we assume that in some neigh- 
borhood of the desired root « of f(x) = 0 the function f(x) has an inverse; 
l.e., we assume « is a simple root. Our approach here is to derive iterative 
methods for the solution of f(x) = 0 by using inverse interpolation, a subject 
introduced briefly in Sec. 3.6. 

Let g(y) be the function inverse to f(x). Given the points y,,j = 1,..., 7, 
we can approximate g(y) using a Lagrangian interpolation formula as 


n 


gly) = Aly) = YUolv) = EUs -j (8.2-1) 


j=l j=1 


using, for later convenience, the notation x;,,-; = g(y;). Our object is to 
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find a value a of x at which y = 0. Since a = g(0), we get from (8.2-1) an 
approximation x;,, for a given by 


Xi41 = h(0) = » L(O)xi+1-; (8.2-2) 
j=l 
where, using (3.2-3), 
1(0) = (= 1)" "yr yaya Vina Vie In 
’ (y; — Vi)\(y; — ya)o" (y; — Yj- My; — Yj+1) _ (y; — Yn) 
(8.2-3) 
From (3.2-9) 
(n) 
a x04 =O (ayy ya od, (8.2-4) 


with 7 in the mterval spanned by y,, ..., y, and 0. The derivative of the 
inverse function can be calculated in terms of derivatives of f(x) {1}. 

Equation (8.2-2) defines an iterative method for finding a root of (8.1-1) if 
the iteration converges. That is, using the points x;,..., X;_,41, we calculate 
X;4,, and then replacing x;_,4, by x; ,, we calculate x;,. using the remain- 
ing n points, and so on. This method is one example of an n-point iteration 
function whose most general form is 


Xin. = F (x;, Nise cees Xi-n+1) (8.2-5) 


where the iteration function F; will in general involve not only x;,...,; Xj-n41 
but also values of f(x) and some of its derivatives evaluated at one or more 
of the points x,. For example, in (8.2-2) 1(0) involves values of f(x) at the 
points x;41;-;,j/ = 1,..., n. We shall assume throughout this chapter that F; 
has as many continuous derivatives as required in a neighborhood of «. 

Methods for finding the roots of (8.1-1) based on (8.2-5) are called 
functional iteration methods; all the methods we shall consider for the solu- 
tion of (8.1-1) are of this form. The subscript i on F; is necessary only when 
the iteration function itself may change from one iteration to the next; 
generally the iteration function will be stationary, i.e., independent of i, in 
which case we shall write F instead of F;. If the iteration using a stationary 
iteration function converges to a root «, we must have 


a = F(a, a,..., o) (8.2-6) 


A concept which will be basic to our discussion of iteration methods 1s 
that of the order of the method. First we define the error in the ith iterate to 
be 


€;44 =a4a-—- Xie (8.2-7) 


where the starting values are taken to be x2_,;,j = 1,...,n. Now assume that 
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the iteration (8.2-5) converges so that lim;_.,, x; = «. Then, if there exists a 
real number p > 1 such that 


|er+1| 


—_ p P 
ic | x; a | i- co le; 


=C #0 (8.2-8) 


we say that the method is of order p at a {2}. If p = 1, 2, or 3, convergence is 
said to be linear, quadratic, or cubic, respectively. The constant C is called 
the asymptotic error constant; it depends on f(x). The requirement that 
C £0 is to be interpreted to mean that C #0 for a general function f(x). 
This assures the uniqueness of p. For a particular function f(x), C may be 0, 
in which case, for this function, the iteration will converge more rapidly than 
usual. When p = 1, C must be less than or equal to | in order for the method 
to converge {2}, but for p > 1, C need not be less than 1 for convergence. If 
p =1andC <1, we are assured of convergence if we start sufficiently close 
to «; this is not so for C = 1. 

We assumed in the foregoing that « is a simple root. But we shall often 
apply iteration methods derived using this assumption to functions which 
have multiple roots. We shall see that the order of the iteration will depend 
on the multiplicity of the root. 


8.2-1 Computational Efficiency 


We could compare methods of functional iteration on the basis of how fast 
they converge, 1.e., on their order, but it makes more sense to compare their 
computational efficiency, which is a measure of how much computation must 
be done to arrive at a given accuracy in a root. In order to arrive at a 
definition of computational efficiency, let us consider trying to find a root of 
(8.1-1) using two different iteration methods, both starting from the same 
initial approximation. Let the two methods have orders p, and p,, respec- 
tively. For large enough i, that is, as we near convergence, the 
approximations 


p= CPP [| = Cal (8.2-9) 


are valid, where C, and C, are the asymptotic error constants of the two 
methods. Let 


S;= —In |e” | T, = —In |e? | (8.2-10) 
Then Sis. = —In C, + py; Ti+, = —In Cy + poT; (8.2-11) 
The solutions of these difference equations are {3} 
S; = S, pi — In (Cpr 1) T; = T, py — In (CHP MP2~) 
(8.2-12) 
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where the initial values $, and 7, are equal. Let the number of steps required 
to get the desired degree of convergence be / and J, respectively, for methods 
1 and 2. Hence |e{!)| and |e?’|, and therefore S, and T,, are essentially 
equal. Then using (8.2-12), we get 


C\p27- 1)/(p2- 1) 


Si (p4 7 p2) + In Geil= np =O (8.2-13) 


If 6 and ¢ are, respectively, the costs per iteration, i.e., the amount of compu- 
tation required per iteration, of the two methods, then the total costs of the 
computation are @/ and @J, respectively. The quantities 0 and @ can be 
estimated from the iteration function, but (8.2-13) gives us no obvious way of 
relating I and J. It is often a good assumption, however, that the second 
term in (8.2-13) is small compared with the first (as will happen, for exam- 
ple, if C, and C, are both close to unity). In this case we get 


| J 
Cone RY 8.2-14 
I/In p, 1/In pp ( 
Therefore, a measure of the cost of a method is the product of 0, the cost per 
iteration, and the reciprocal of the logarithm of the order p. The efficiency 
index, which is the reciprocal of the cost, may be defined as (In p)/0 or, to use 
the more usual definition, 


EI = p' (8.2-15) 


A particular case in which (8.2-13) can be used directly to estimate I/J is 
considered in {11}. 

It is worth noting that since the order of a method is a property local to 
the neighborhood of a root, the efficiency index measures only how good a 
method is when it is near convergence. A determination of the efficiency 
of a method outside of the neighborhood of a root is generally extremely 
difficult. 

The cost per iteration will, by analogy with our argument in Chap. 5, 
depend mainly on the number of evaluations of f(x) and its derivatives 
required at each step and not on the arithmetic operations required to 
combine these quantities in the iteration function F; of (8.2-5). We noted in 
Sec. 5.9-1 that, having computed f(x), we can often compute f’(x) quite 
cheaply on a digital computer. For example, if f(x) is composed of elemen- 
tary functions, the chief cost of evaluating f(x) is in the evaluation of these 
elementary functions. Thus, since f’(x) will also be some combination of 
these elementary functions, the evaluation of f’(x) is simple {4}. Another 
example is the case where f(x) = {3 g(t) dt; then f’(x) = g(x), which will 
usually have been computed previously to get f(x). 
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8.3 THE SECANT METHOD 


This is one of the oldest methods known for the solution of f(x) = 0, but it 
has been surprisingly neglected until recently, when its important advan- 
tages, particularly for use on computers, have again been realized. For this 
reason and because it serves to illustrate various aspects of iteration 
methods, we shall discuss it before considering some more general aspects of 
functional iteration. In this section and the following two we assume that « is 
a simple root of (8.1-1). 

Our approach in this section will be to use linear inverse interpolation, 
that is, (8.2-2) with n = 2, to derive methods of the form (8.2-5). One such 
method illustrated in Fig. 8.1, is the method of false position or regula falsi. 
Suppose we can find two points x, and x, such that f(x,)f(x.) < 0. The 
chord joining y, and y, intersects the x axis at a point x,. Then choosing x, 
and x;, i= 1 or 2, such that f(x3)f(x;) <0 (in the case shown i = 1), we 
repeat the above procedure to obtain x, and so on. From a study of Fig. 8.1 
it is clear that this method converges for any continuous function f(x); we 
leave the details of the proof to a problem {5}. The method of bisection 
discussed in Sec. 2.2 also converges for all continuous functions. 

The method of false position is a nonstationary iteration method in 
general. From Fig. 8.1 we have 


x3 xy X42 
y2— V1 Yi — y2 

X4 —- Y3 xX + YI — X3 (8.3-1) 
¥3— Vi Yi — y3 

X5 = Ya X3 —~X4 


Figure 8.1 The method of false position. 


THE SOLUTION OF NONLINEAR EQUATIONS 339 


Therefore, i= | ~ (8.3-2) 


oo Xi 

Yi — Yi-2 Yi-2— Si 
For certain functions, however, the method of false position is a stationary 
method of functional iteration. For example, if f(x) is convex between x, 
and x,, the method is stationary. From Fig. 8.2 we see that the point x, is 
always one of the two points used to get the next iterate. Therefore, we have 


Si x, + Yi 
ViVi Yi — Si 


xX; (8.3-3) 


for all i. 
Since the method of false position is just a sequence of linear inverse 
interpolations, the error in this case can be written down, using (8.2-4), as 


€i+1 =H — Xi41 =) yy, = ep (8.3-4) 
since {1} i) = - FoR (8.3-5) 
Now since f(a) = 0, we have, using the mean-value theorem, 

yi =f (1) =f (x1) —f(@) = (1 — &) F'(E1) (8.3-6) 
and similarly y; = (x; — «)f'(E;) (8.3-7) 


with €, and é; in appropriate intervals. Using these two equations in (8.3-4), 
we have 


FOF (CFG) 


€41.= IOP EE; (8.3-8) 


Figure 8.2 The method of false position for convex functions. 
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therefore, im ltl | POLEMIC) 

i 00 le; | 2c f'(@))° 
since €, approaches a and € approaches some limiting value € as i— oo. We 
have assumed that f(x) is bounded away from zero in a neighborhood of «. 
From this it follows that p = 1. Thus the method of false position has linear 
convergence for convex functions. A similar situation holds for concave 
functions. Since almost all functions are ultimately either concave or convex 
in the neighborhood of a root, we see that the method of false position has 
linear convergence for almost all functions. 

The source of this poor convergence is that successive iterates lie on the 
same side of the root. By an appropriate modification of the method we can 
usually bring a new iterate over to the other side. Thus, suppose that x;_ ,, 
x;, and x;,, are such that y,;_, y; < 0 and y, y,;,, > 0. In this situation, it is 
desirable to modify regula falsi by choosing a different method to compute 
X;42 using the bracketing points x;_, and x;,,. The simplest approach 1s to 
use bisection (see Sec. 2.2), taking x;, 2 = 3(x;_; + x;4,). Then if x;,, and 
X;4 2 straddle the root, we proceed as in regula falsi. Otherwise we bisect 
again until we get a pair x; ;, Xi+;+, which straddle the root. 

A more efficient approach consists in applying the false-position for- 
mula to the points x;_ , and x;,, but with y,;_ , replaced by «y;_ , = j,_, for 
some a such that 0 < a < 1. The improvement such a procedure can bring 
about is obvious from Fig. 8.3. Our formula for x; becomes 


aVi-1 Vi+1 


lex | (8.3-9) 


X42 = — Xi 1 HF Xie 1 
AVi-1 — Vi+1 Yi+1 — 2Vi-1 
= tt x +— Ht x, (8.3-10) 
Yi-1 — Yi+1 Yi+1 — Yi-1 


If y;+2Vi4, <0, we continue as in regula falsi. Otherwise we do another 
modified step using a new « based on y,,, and j;_. 


f(x) 


laf (x,) 


Figure 8.3 The Illinois method for convex functions. 
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The simplest choice for a is « = 3 (called the Illinois method). It can be 
shown that in this case, ¢,, , ~ ke}, so that the average order of the method is 
31/3 = 1.442. If we take a = y;/(y; + y;41) (called the Pegasus method), the 
average order increases to (7.275)'/* = 1.642. A further refinement consists 
of taking 

va lP B>0 g = Sir x 
5 B<0 f [xis Xi-1] 
in which case the average order of the method is at least 1.68. Here, f[x,, x,] 
is the divided difference (y, — y;)/(x, — x;)- 

These modified versions of regula falsi are excellent methods when we 
have two points at which f(x) has opposite signs. If we do not wish to 
expend the effort to find such points, we can use linear inverse interpolation 
through the last two computed points to generate the next point in the 
sequence. Thus, using x; and x;_, to generate x;,,, we have 


Xi4d =i, , 4+—it oy. (8.3-11) 
Yi — Yi-1 Yi-1 — Yi 
This stationary iteration method is called the secant method. 
For (8.3-11) the analogous equation to (8.3-8) is 


f(A EDF (Gi-1) 
6 = ooo GG G- 8.3-12 
UF OF ! ee”) 
If the initial approximations Xp, and x, are sufficiently close to a, then, since 
f'(€) is bounded in some neighborhood of «, the iteration will converge to a. 


Assume now that the iteration does converge so that all iterates are con- 
tained in some interval J. On this interval let 


O<ms [fl <Mr | f")|<Mz (83413) 
Then from (8.3-12) 


levzi] < Kl a| le-1] (8.3-14) 
where K = M, M{/2m}. Now let K |¢;| = d;. Then we can write (8.3-14) as 
di, <4;d;_, (8.3-15) 


Now let “sufficiently close” above mean that dy and d, are both less than or 
equal to d < 1. Then from (8.3-15) 


dg<d* d3<@ d,<d ds<d® (8.3-16) 
and in general d;<d"' (8.3-17) 
where Vie = 7; + Yi-1 |= l, 2, wey Vo = 371 = 1 (8.3-18) 


Equation (8.3-18) is the difference equation for the Fibonacci numbers. The 
indicial equation for (8.3-18) is 


pz=pt1 (8.3-19) 


342 A FIRST COURSE IN NUMERICAL ANALYSIS 


which has the solutions (1 + ./5)/2 and (1 — ./5)/2. Therefore, the solution 


of (8.3-18) is {6} 
slt2)-C54)"] om 


Yi = /5 9) 
l 
Furthermore, lim 2+! — its ~ 1.618 (8.3-21) 
ito di 


Now from (8.3-17) and the definition of d; we have 


lal pa" (8.3-22) 
This together with (8.3-21) suggests the result that {6} 
lesa. | 7 f(a) (/5—1)/2 

lim =~ (a | FP) (8.3-23) 


since as x; approaches a, the coefficient of ¢;€;_, in (8.3-12) approaches 
| f"(«)/2f'(a)|. The reader will have noted that the foregoing by no means 
constitutes a proof of (8.3-23). A rigorous proof of (8.3-23) is beyond our 
scope here but will be found in Ostrowski (1973). From (8.3-23) it follows 
that the order of the secant method is (1 + ./5)/2 and thatt 


EL =! tvs (8.3-24) 
Not only is this substantially better than regula falsi (EI = 1), but, as we 
shall see, the efficiency of the secant method compares favorably with many 
seemingly more sophisticated methods. 
The fact that the secant method does not require the evaluation of any 
derivatives can be a great advantage in certain problems. For example, a not 
uncommon problem is to find a root of 


F(X5 Vis Yas e+) Ye) = 0 (8.3-25) 
where y= g(x) j=Hl...,k (8.3-26) 


Evaluating derivatives of f with respect to x is often impractical in such a 
case. 

A disadvantage of the secant method would seem to be that multiple- 
precision arithmetic would be required as the iteration nears convergence 
because (8.3-11) then involves the difference of two nearly equal quantities y; 
and y,_ ,. But let us rewrite (8.3-11) as 


Xit¢a = Xi Mi Aiet yi (8.3-27) 
Yi — Yi-1 


+ Here and hereafter in this chapter we shall estimate the cost of evaluating the iteration 
function F by considering only the evaluations of f(x) and its derivatives. 
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The second term can be considered a correction term to x; and as such 
requires only a very few significant digits as convergence is neared (why?). 
Therefore, although the quotient (x; — x;_,)/(y; — yj-1) will have very few 
significant digits if multiple-precision arithmetic is not used, there will never- 
theless generally be enough significant figures to compute « to nearly full 
single-precision accuracy. 


Example 8.1 Find the positive root of sin x — x/2 =O using the secant method, the 
method of false position, and some of its modifications. 

Since the root lies between 7/2 and x (see Fig. 8.4), we use x9 = 2/2, x, = a. Using 
(8.3-11), (8.3-3), and (8.3-10) with « = 4 and with « = y;/(y, + y;4,), we get (note that 
f"(x) <0 in [x/2, 2] so that f(x) is concave) 


Secant Method of Illinois Pegasus 
method false position method method 
X5 1.75960 1.75960 1.75960 1.75960 
X3 1.93200 1.84420 1.91904 1.99169 
X4 1.89242 1.87701 1.89349 1.88772 
X¢ 1.89543 1.88895 1.89547 1.89509 
X¢ 1.89549 1.89320 1.89552 1.90226 
x5 1.89469 1.89549 1.89549 
Xg 1.89521 
Xo 1.89540 
X10 1.89546 
X44 1.89548 
X12 1.89549 
J 


f(x) 


Figure 8.4 Graph of f(x) = sin x — x/2. 
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As we would expect, the secant method converges substantially faster to the true value 
1.89549 than the method of false position. On the other hand, both the Illinois and 
Pegasus methods perform almost as well as the secant method while retaining the bracket- 
ing property of the method of false position. Note also the slow convergence of the method 
of false position as the root is approached. 

While the method of false position does not depend on the ordering of the initial 
values, all other methods illustrated above do. Thus, if we take x9 = 2, x, = 7/2, we get the 
following results: 


Secant Illinois Pegasus 

method method method 
X> 1.75960 1.75960 1.75960 
X43 1.84420 1.84420 1.84420 
X4 1.90011 1.90820 1.95258 
Xs 1.89535 1.89511 1.89381 
X6 1.89549 1.89549 1.89544 
X4 1.89708 
Xs 1.89549 


The switch improved the performance of the Illinois method while the convergence of the 
Pegasus method slowed down slightly. 


8.4 ONE-POINT ITERATION FORMULAS 


Equation (8.3-11) is a two-point iteration method; i.e., to compute x; we 
use information at two previous values. Formulas of the class which use 
information at only one point are naturally called one-point iteration for- 
mulas. In this section we shall consider only stationary one-point iteration 
formulas which have the form 


X41 = F(x;) (8.4-1) 


with « = F(a) if the method converges. We first prove a theorem: 


Theorem 8.1 Assume that F(x) has sufficiently many derivatives. The 
order of any one-point iteration function F(x) is a positive integer. More 
specifically F(x) has order p if and only if F(a)=a; F(a) =0, 
1<j<p; F(a) #0. 


Proor We expand F(x;) in a Taylor series about a 


(x; — a)?” 


F(x;) =a + (x; — a)F'(a) + -°° 4+ (p= 1)! Fe D(x) 
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where € lies between x; and a. Since F(x;) = x;4 1, we have 


— Pp 
Xj41) —= ma Fé) (8.4-3) 
. Xi+ —@ l 
Therefore, lim Few aT | F(«)| #0 (8.4-4) 


if the iteration converges; this proves the “if” part of the theorem. On 
the other hand, it follows easily from (8.4-2) and (8.4-4) that if F(a) 4 0 
for some j between 0 and p or if F(a) =0, then F(x) cannot be of 
order p. 


In this section we shall develop a particular class of one-point iteration 
functions which contains members of all integral orders. Although this class 
by no means exhausts the class of one-point iteration functions, there does 
exist a relationship between this class and all other one-point iteration 
functions which is, however, beyond our scope here. 

We assume, as in Sec. 8.2, that in a neighborhood of a root « of (8.1-1), 
f (x) has an inverse g(y). The Taylor-series expansion of g(y) about a point y; 
is given by 


a y— Vy ay Vo m2) 
x = gly) = a 9 (yi) (m+ 2) 9 (1) 
m+ 1 | _ j — y,\mt+2 
_ yy. y yi (A) 4). (y yi) (m+ 2) 8 4-5 
= xX, + 2 j! g (y;) + (m + 2)! g (n) ( . ) 


y= Yi 
where n is between y and " Since a = g(0), we have 


mri l m+2ym+2 nt 
a= x; 4 Lo 7 g( (y;) + eee g' 2)(n) 
= xi + sy a, rt 2gim*)(n) (8.4-6) 
(m+ 2)!°' 


where we have written y,; = 7 fle) =f; and g(y,) = g¥”. 
Equation (8.4-6) suggests consideration of the iteration formula 
m+ '( 
Xisy = Xi + y ! (-1y 
j=1 j! 


The value of (8.4-7) will depend partly on whether the g!) can easily be 
calculated. Clearly 


—— fig? (8.4-7) 
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For j > 1, we can compute the g) in terms of derivatives of f(x) at x = x; as 
follows. If we define the operator D = d/dy, then 


id 
of’ dx 
. . 1 d\i"'1 
jg — INJ-1y' = J __ 
Hence Dig = D!"“g (=5,| 7 


so that we have expressed g(y) in terms of derivatives of f(x). The actual 
computation of D’g is given by the formula 


F, 
Dig =—__— J = l, 2, 3, coe 8.4-8 
Fe ees) 
where the F; are defined by the recursion 
F,=1 Fa, =f'F;, — (2) — 1 f"F; (8.4-9) 


which can be easily proved by induction {8}. Thus, the main problem in 
evaluating (8.4-7) is to evaluate f;, f/, and F,,j=2,...,m+ 1. 
Subtracting (8.4-7) from (8.4-6), we have 


(=1)"*? 


Cit “Carat 9 )(n) (8.4-10) 
Since fi =F (xi) =f (xi) — F(a) = (5) — a) P(E) (8.4-11) 
with €, between x; and a, we have 
1 
Gis. = (m+)! (LF '(Es) 1" * 7g" * (ner *? (8.4-12) 


If the root is simple, the term in braces in (8.4-12) is bounded in some 
neighborhood of «. Therefore, the order of (8.4-7) is m + 2, and if the initial 
approximation is sufficiently good, the iteration will converge. (Even for a 
bad initial approximation it may converge; see Sec. 8.7.) 
The evaluation of (8.4-7) requires the evaluation of f(x;) and its first 
m + 1 derivatives (why?). If 6; is the cost of evaluating f” (x,) relative to the 
cost of evaluating f(x), which as before we take to be 1, the efficiency index 
of (8.4-7) is given by 
m+1 
EI=(m+2)'”  wherew=1+ 990, (8.4-13) 
j=1 


J 


One important and familiar special case of (8.4-7) is that for m = 0. We have 


Xing =X — MX os (8.4-14) 
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which is the familiar Newton-Raphson iteration. The error is given by 


(8.4-12) as 
_41 ; 20 e2 _— _ f'(EUS (Es)? «2 
€i41 = aLf (€,)] g (n) i rep i 


In fact it can be shown {10} that the terms in f’(x) can be canceled as if 
¢ = fy = X;, SO that 


(8.4-15) 


Lf") 2 
Get = 75 Fic.) G; (8.4-16) 
From (8.4-13) the efficiency index of the Newton-Raphson method is 
2/0 +9), where 0, is the cost of evaluating f'(x). A straightforward calcula- 
tion leads to the result that if 0, < .44, the efficiency of the Newton-Raphson 
method is greater than that of the secant method (8.3-11). As we pointed out 
previously, the cost of evaluating the derivative is often much less than that 
of evaluating the function, a notable exception being polynomials. If a deci- 
sion is to be made whether to use (8.3-11) or (8.4-14) to solve (8.1-1), then a 
perfectly reasonable basis for this decision is to estimate 0, and use (8.3-11) if 
it is greater than .44 and (8.4-14) otherwise. 


Example 8.2 Repeat the calculation of Example 8.1 using the Newton-Raphson method 
starting with, first, x, = 2 and, second, x, = n/2. 
The results are 


X,=1 x, = 7/2 


x, 2.09440 2.00000 
x; 1.91322 1.90100 
x, 189567 1.89551 
x, 1.89549 1.89549 


The convergence in both cases requires 1 less iteration than the secant method. Since 
f'(x) = cos x — 4, the derivative is somewhat, but not a great deal, easier to compute than 
the function (because to compute the cosine from the sine a square root will have to be 
calculated). But if, for example, we had f(x) = e* — x, the derivative would be much easier 
to calculate than the function. 


8.5 MULTIPOINT ITERATION FORMULAS} 


In this section we shall consider some examples of stationary multipoint 
iteration functions. Such iteration functions have the form (8.2-5) with n > 1. 
Of the various possible approaches to the derivation of multipoint iteration 


+t Traub (1964, chap. 8) uses “multipoint” in a different context. He calls the iteration 
functions of this section “one-point iteration functions with memory.” 
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functions, we shall consider two in this section, the first because of its general 
theoretical interest and the second because it leads to some particularly 
interesting and useful formulas. Again we shall assume that « is a simple 
root. 


8.5-1 Iteration Formulas Using General Inverse Interpolation 


In Sec. 8.2 we used inverse Lagrangian interpolation to derive a class of 
methods of functional iteration. Here we shall generalize this technique by 
using a general polynomial interpolation formula which has the property 
that, at a point x;,,-, the interpolation polynomial y(x) and its first r; 
derivatives agree with f(x). By analogy with (3.2-9), using the points 
Xi+1-j,J = 1,...,n, and assuming that f has as many continuous derivatives 
as we desire, we may write {17} 


perme) a 

F(x) = y(x) + I] — Xie)" B=n—-1+ dir, (85-1) 
(B+ I! +1)! j= jel 

where € lies in the interval spanned by x;, ..., X;_,4, and x. For a direct 


application of this formula to functional iteration, see {19 to 21}. Our interest 
here, however, is in interpolating using the inverse function g. Assuming 
this function exists and has as many continuous derivatives as we desire, 
then corresponding to (8.2-1) we have 


g(y) = a(y) + — () iat y—y)rr? (8.5-2) 


where y, = f(x;+,-,;) and 7 lies in the interval spanned by y,,..., y, and y. 
Since « = g(0), we get, by analogy with (8.3-2), an iteration formula given by 
Xi+1 = (0), where g(0) is a linear combination of x,;4,-;=g(y,), j= 
I, ..., n, and derivatives of g evaluated at y,, j= 1, ..., n. For example, if 
r, = 1 for all j, then from the Hermite interpolation formula we get 


Xi+1 = ¥ hy): 1-jt YA, Oav) (8.5-3) 


This formulation of multipoint iteration formulas includes as a special 
case the one-point iteration formulas of the previous section. The Taylor- 
series expansion (8.4-5) is identical with the generalized interpolation for- 
mula when n=1; that is, only one point is used. Note that the 
Newton-Raphson method is given by (8.5-3) with n= 1. Almost all the 
well-known methods of stationary functional iteration are derivable as 
special cases of this general approach. Besides the Newton-Raphson method 
another example of this that we have seen is the secant method, which is an 
application of linear inverse Lagrangian interpolation. 
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Of particular interest to us here is the order of an iteration formula 
derived from (8.5-2). Analogous to (8.2-4) we have 


aa xp = (-1p An ) LL v1 (8.5-4) 


Now since 


yi= f (Xi41-;) = f (x;+1-;) — f (a) 


we can rewrite (8.5-4) a 
a (8.5-6) 


We assume the iteration converges so that all iterates lie on some interval /. 
Let K be such that 


(B+ 1) 
“get Fpl <K (8.5-7) 
Ils yr? 


for all 4 and 4;. Then 


leis. | SK [] Jeigy-,|0*? (8.5-8) 
j=l 
This equation is reminiscent of (8.3-14). Indeed we can show that 
diy xd" i=1,2,...; d<l (8.5-9) 
where d; = K'/%c; and the y,’s satisfy 
. i=n—Il,n,... 
viel = r+ 1)y,-; _ (8.5-10 
i PAs Mims YoHV1 == M1 = 1 
The indicial equation of (8.5-10) is 
pr= V(r t ler? (8.5-11) 
j=l 


so that y; is given by some linear combination of the zeros of the polynomial 
(8.5-11). From these zeros we might expect to be able to indicate the order of 
convergence of the iteration as we did in Sec. 8.3, and indeed this can be 
done. We shall content ourselves with stating the result that in the case 
r; =r, for all j, the order of the iteration defined by x;,, = q(0), with g(y) as 
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n (8.5-2), is given by the only real root of (8.5-11) with magnitude greater 
than 1. This root is positive and lies in the interval (r + 1,7 + 2) for alln. The 
derivation of this result can be found in Traub (1964, pp. 62-67). 


8.5-2 Derivative Estimated Iteration Formulas 


Consider the Newton-Raphson method (8.4-14). Let us replace the deriva- 
tive term by its approximation found by differentiating a two-point Lagran- 
gian formula for f(x), the two points being x; and x;_,. We have 


f(x) & PS Sa) + Se) (8.5-12) 
so that f'(x;) ¥ Lain) Se) (8.5-13) 


If we substitute (8.5-13) into the Newton-Raphson method, we get precisely 
the secant method, which is not a very interesting result {22}. 
Now consider the method (8.4-7) with m = 1, which is 


f(x) 1 UF (x F"(%:) MNF" (i) (8.5-14) 


Mit, = Xi 7 FT 5 
fx) 2 [f'C GaP 
Let us replace the second derivative in (8.5-14) by its approximation found 
by differentiating a two-point Hermite interpolation formula for f(x), the 
two points being x; and x;_,;. We have {22} 


+ (x — x;)(x — x;- 1)’f'(xi) + (x — xX;~1)(x -— xi) f'(xi- 1)|  (8.5-15) 


+ ——— [2f"(xi) + f'(xi-1))] (8.5-16) 


Xj; Xj-1 


Substituting the right-hand side of (8.5-16) into (8.5-14), we obtain an itera- 
tion formula 


joy =x LO) - (x) (8.5-17) 
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f"(xi) = SLs) — f (x:-1)] +e [2f'(x;) + f'(x;-1)] 


h; => xX; - Xji-4 (8.5-18) 


which, like the Newton-Raphson method, depends upon f(x) and its first 
derivative. The order of (8.5-14) is 3. Of interest to us here is the order of the 
modified formula (8.5-17). 

Let us call the right-hand side of (8.5-14) F,(x;) and the right-hand side 
of (8.5-17) F,(x;, x; ,). Then 


F (xi) — Fa(xi, %i-1) = ‘aa Bio epee) (8.5-19) 


where the term in brackets on the far right is the result of differentiating the 
error term of the Hermite interpolation formula twice and then evaluating 
the result at x = x;. Now, using (8.4-12), we have 


a — F,(x;) = alt (E,) Pa" (n)}e? €i =A — X; (8.5-20) 
Then from (8.5-19) and (8.5-20) 
TAO)? (x; — X;- 1)? fi 


€i+1 =a — F,(x,;, x;-1) = 2[ f(x 2 f'(x;)P 12 (¢) 
+ ollf' (CP a" (jer (8.5-21) 
Now (x; — xj-1)? = (w-— x)-, +x, -— a)? = (G — G1) (8.5-22) 


Substituting this into (8.5-21) and using (8.4-11), we have 
¢ , mat 
cet = OE He 6 +AU EPO} 85.23) 
As we near convergence, the errors in (8.5-23) will be monotonically decreas- 
ing in magnitude. The dominant term in (8. 5-23) is either ¢? or €7¢?_ , (why? ?). 


Our object is to show that the €7¢?_ , term is in fact dominant. To do this, it is 
sufficient to consider the order of a hypothetical formula with an errort 


FOF (Ea)V? 2. 
G4. =e Oe 8.5-24 
Sree B29) 

Proceeding as we have previously, we obtain 
Geil <KG|?|G-1)? (8.5-25) 


+ The bars in (8.5-24) and (8.5-25) are to distinguish the errors in this hypothetical formula 
from the errors in (8.5-23). 
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where K is such that 


- "( ce 2)? 

x} 

on some interval including a. With d; = K'/*|é;| we get, using (8.5-8) to 
(8.5-11), that 


<K (8.5-26) 


die, <d@ i=1,2,..; det (8.5-27) 
where y =43{(1 + 73) + (1 -— 73) (8.5-28) 


Therefore, by reasoning similar to that in Sec. 8.3, we conclude, but again do 


not prove, that 
1/3 


— Fewal _| £%@) 
re fe) | Fe) 


Therefore, the order of (8.5-17) is 1 + J3 ~ 2.732, and the efficiency index is 
given by 


(8.5-29) 


= (14 /3)aren (8.5-30) 


which means that (8.5-17) is a distinct improvement over the Newton- 
Raphson method, its only relative disadvantage being the need for two 
starting values. 

The procedure described above can be generalized in a number of direc- 
tions. One way would be to approximate the (m + 1)st-order derivative in 
the iteration (8.4-7) using an interpolation formula based on the first m 
derivatives. Another generalization would be to use more than two points in 
approximating f”(x;) in (8.5-16) {23}. All such methods of functional itera- 
tion are called derivative estimated iteration formulas. 


Example 8.3 Repeat the calculations of Example 8.1 using the iteration formula (8.5-17) 


with x; = 7, x, = 7/2. 
The calculations give 


i x; f"(x;) f"(x;) 


2 n/2 — 1.15847 — 1.00000 
3 1.78659 — .98064 — .9768 | 
4 1.89414 — .94910 — .94818 
5 1.89549 — .96230 — .94775 


This method converges after three iterations, compared with four with the Newton- 
Raphson method in Example 8.2. This is as we would expect because the order of (8.5-17) 
is 2.732 while that of (8.4-14) is 2. The last two columns in the table give the approximation 


+t Note that we are assuming here that the additions and multiplications needed to evaluate 
(8.5-17) and (8.5-18) require a negligible amount of time compared with the evaluation of f(x;,) 
and f'(x,). This assumption is not unreasonable in general. 
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to the second derivative given by (8.5-18) and the true value. Note that as the root is 
approached, the difference between the approximate and true values of the derivative first 
decreases and then increases. This is a phenomenon of numerical differentiation that we 
discussed in Chap. 4. When x; — x;_ , 1s comparatively large, truncation error dominates 
and decreases as x, — x;_ , decreases. But when x, — x;_ , gets very small, roundoff domin- 
ates and grows as x; — x;_, decreases. Note, however, that as x; > a, the coefficient of 
f"(x,) gets very small; therefore, the loss of significance in f”(x,) is not important. 


8.6 FUNCTIONAL ITERATION AT A MULTIPLE ROOT 


All the results of the past three sections depend upon the root « being simple. 
In particular, if « is a root of multiplicity r > 1, all our derivations based on 
inverse interpolation are invalid because the inverse function does not exist 
in any neighborhood of x = «a. 

Nevertheless, we can still consider the behavior of such formulas in the 
neighborhood of a multiple root. We consider here the class of methods 
(8.4-7). 

We show first, that, independent of m, (8.4-7) converges linearly when a is 
a root of multiplicity greater than 1. From (8.4-7) we have 

m+ 1 (— 1) — m 
€j44 ~%— X41 =X Xi — yy — fig? =€& + y Z ;(x:) (8.6- 1) 
jan J! j=0 
where we define 


Z(x;) = (—1p"" fitigut? (8.6-2) 
me (FHI s 
Using the fact that (g!”)’ = f;g/* ’), where the prime denotes differentiation 
with respect to x, a simple calculation shows that 
. | Si 

(j + 1)Z; =jZ;-, — ujZj-4 u; = —~ (8.6-3) 

Now consider the Taylor-series expansion of Z,(x;) about a. Since Z (a) = 0 
(why?), we have 


Z (x;) = y Cin Ef (8.6-4) 
From (8.6-1), (8.6-2), and (8.6-4) {24} 
Gap = a(t + y en} + O(e?) (8.6-5) 


We suppose that « is a root of multiplicity r. Then {24} 


u,= — choy O(e?) (8.6-6) 


r 
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Substituting (8.6-4) into (8.6-3) and using (8.6-6), we get {24} 


. 1 
(i+ L)e jy = JCj-1,1 — ~Cj-1,1 (8.6-7) 
r 
j- l/r | 
7 ns a (8.6-8) 
41{1 
Therefore, Cj, =(- (?) (8.6-9) 
jtl 


where (1/r);,, is a binomial coefficient. Substituting (8.6-9) into (8.6-5), we 
get 


Gai = a + Eeye() + O(¢) 


j=0 r 
m+1 1 
= «)"S (2) | + o1@) 
j=0 Ph ij 
] 
= ( Intel — 7 + O(€?) (8.6- 10) 
m+1 
mt+1 1 7 1 
since {24} Y (-1P{(-] =(-1)"*?F- - 1 (8.6-11) 
j=0 Yj r m+1 
Finally we have from (8.6-10) that when the iteration converges, 
: €i+1 +1 I 
lim |—4)=(-1)"*'1- -—1} 40 (8.6-12) 
ino | & r m+1 


ifr # 1, which proves that the order of (8.4-7) is 1 if « is not a simple root. 
But note that since (1/r — 1),,4, has magnitude less than 1, the methods of 
this class do converge in the neighborhood of a multiple root. 

If the multiplicity of the root at « is known, the class of methods (8.4-7) 
can be modified so that they retain their order of convergence of m + 2. In 
particular, the Newton-Raphson method (8.4-14) can be modified so that it 
still has quadratic convergence for a root of multiplicity r if we write 


_ Ff (xi) 
Xi4, =X -07 F'(x) (8.6-13) 
We have O—-X4,=a—-xX, +r a (8.6-14) 
so that (« — x;41)f'(x;) = G(x;) (8.6-15) 


where we have defined 


G(x) = (« — x) f'(x) + of (&) (8.6-16) 
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Differentiating, we have 
Gee) = of x) + (= x) fH" Mx) — F/M) (8.617) 


and, since « is a root of f(x) of multiplicity r, 


Ga)=0 j=0...,7 Gta) #0 (8.6-18) 

Therefore, G(x) = east Gt (E,) (8.6-19) 
Since 

f'(x) = ar FE) (8.6-20) 


because « is a root of f’(x) of multiplicity r — 1, we have, using (8.6-19) and 
(8.6-20) in (8.6-15), 
1 Gees), 
C41 = r(r 4 1) f(E,) ej (8.6 21) 
Therefore, the order is 2 since f(€,) is bounded away from zero in a 
neighborhood of x = a and G"* (x) #0. 

Generally, however, the multiplicity of the root is not known a prior. 
Thus it would be very desirable to have iteration methods whose order of 
convergence is independent of the multiplicity. Such methods can indeed be 
found; the key to finding them is to note that u(x) has a zero of multiplicity 1 
at x =a no matter what the multiplicity of the zero of f(x) (why?). There- 
fore, if, instead of (8.1-1), we consider the equation 


u(x)=O0 — [u(x) = f(xV/ FO) (8.6-22) 


the roots of this equation are identical with those of (8.1-1) and they are all 
simple. We need then only replace f(x) by u(x) in any iteration formula we 
have developed thus far to get a formula whose order of convergence is 
independent of the multiplicity of the root. For example, the secant method 
(8.3-11) and the Newton-Raphson method (8.4-14) become 


_ u(x;) . u(x;— 1) . ; 
1 lx) — wlio) * wea) — le) 6025) 
and Xs, =x, — 1) (8.6-24) 


The efficiency of each of these methods is, however, less than that of the 
secant and Newton-Raphson methods, respectively, because of the need to 
calculate one higher derivative in each case. Furthermore, u(x) will have 
poles at those roots of f'(x) which are not roots of f(x) so that u(x) may no 
longer be a continuous function. 
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Example 8.4 Find the positive root of (sin x — x/2)? = 0 using (1) the Newton-Raphson 
method (8.4-14); (2) the modified Newton-Raphson method (8.6-13) with r = 2; (3) the 
modified Newton-Raphson method (8.6-24). This equation is, of course, identical with that 
of Example 8.1, but here we use the above form for illustrative purposes. 

We have 


f(x) = (sin x— 3) f(x) = 2(sin x — 5} (c0s x - 4) 


f"(x) =2 (cos x — 4)? — sin (sin x—- sy 


_ fe) 
fe) 


_ feet") 
(f'@P 


u(x) u(x) =1 


Using x, = 7/2 in all cases, we calculate the following results: 


Method 
l 2 3 Method 1! 
X 1.78540 2.00000 1.80175 X10 1.89512 
X4 1.84456 1.90100 1.88963 Xi 1.89531 
X4 1.87083 1.89551 1.89547 X12 1.89540 
Xs 1.88335 1.89549 1.89549 X13 1.89545 
X6 1.88946 X14 1.89547 
X5 1.89249 X15 1.89548 
Xs 1.89399 X16 1.89549 


Xy —«:1.89475 


As we expect, methods 2 and 3 converge rapidly and much faster than method 1. Note that 
each successive iterate for method 1 has about one-half the error of the previous iterate 
{25}. 


8.7 SOME COMPUTATIONAL ASPECTS OF 
FUNCTIONAL ITERATION 


Each method of functional iteration we have discussed has the property that 
if the initial approximation 1s sufficiently close to the root a, the method will 
converge if a is a simple root. For a one-point iteration method, this is true 
even if the root is not simple although, as we have seen, the convergence will 
be slower in this case. In general, however, it is not possible to prove that a 
multipoint iteration function will always converge to a multiple root even if 
the initial approximations are arbitrarily close to the root. For the secant 
method it is easy to see that the iteration may not converge to a double 
root by considering the case in which x, and x, are on opposite sides of the 
root and are such that f(x,) =f (x2) {27}. 


THE SOLUTION OF NONLINEAR EQUATIONS 357 


If the initial error is not small enough to guarantee decreasing errors in 
every subsequent iteration, the iteration may (1) diverge or (2) converge 
anyhow because a small initial error is a sufficient but not a necessary 
condition for convergence. The two examples below illustrate these two 
types of behavior. 


Example 8.5 Use the Newton-Raphson method to try to find the root of xe * = 0 starting 
with x, = 2. 

The graph of the function is shown in Fig. 8.5. Since x;,, in the Newton-Raphson 
method is the intersection of the x axis with the tangent to the curve at f(x,), the diver- 
gence of the method in this case is clear. On the other hand, if 0 < x, < 1, the iteration 
would converge {28}. 


Example 8.6 Use the Newton-Raphson method to find the positive root of x?° — 1=0 
starting with x, = 4. 
For f(x) = x?° — 1 Eq. (8.4-14) becomes 


Xie. = Xi 


l (4° —1 I 19 
so that *2=5 ~ 90/289 ® 552 = 26,214.4 


Thus, because 4 was not close enough to the root x = 1, the first iterate leads to a far worse 
result. But 


x29_ 1 19 
ants) © 5 %2 
20x; 20 


X3 = %2— 


and thus lies closer tox = 1 than x,. In fact, it is not hard to see that successive iterates do 
in fact converge to 1, albeit very slowly {30}. 


The slow convergence in Example 8.6 illustrates one of the computa- 
tional problems associated with the use of functional iteration on a digital 
computer. It is important that the computer be programmed to recognize 
this slow convergence and take appropriate action. For example, if 


Figure 8.5 Graph of f(x) = xe-*. 
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|x;+1/x,| is greater than some specified constant k, then, instead of the 
computed x;,,, the program would set x;,, = +Kx;, where K is another 
specified constant, for example, K = k/2, and the sign agrees with that of 
Xi41/X; {30}. 

It is clearly also important that divergence be recognized. When it is, a 
new starting value should be tried or an always-convergent method such as 
the modified method of false position or bisection should be used. 
Sometimes successive iterates may oscillate in such a way that it is not clear 
whether they are converging or not. In this case, it is often more efficient to 
err on the safe side and take action similar to that in the case of diveregence. 


8.7-1 The 67 Process 


When an iteration is converging linearly, it is possible to use a technique 
similar to Richardson extrapolation (see Sec. 4.2) in order to accelerate the 
convergence. If the iteration is converging, we have 


%— Xi41, = Ci(a — x;) [Ci] <1 (8.7-1) 


where |C;| > C, the asymptotic error constant. Near convergence C; will 
remain nearly constant, and we can write 


a — X41 % C(a — x;) IC} =C (8.7-2) 
Writing (8.7-2) with i replaced by i + 1 and eliminating C, we have 


O—Xi+2 4 X41 (8.7-3) 
O— Xi+t a— Xx; 
Solving for a, we obtain 
~ XiXi+2— XP _ (Ax;+1)? 87-4 
OY = X42 — SQ (8.7-4) 
Xin — 2Xj41 + X; A*x; 


where A is the forward-difference operator of Chap. 3. This extrapolation 
procedure is associated with the name of Aitken. Because of the second 
difference in (8.7-4) (which could be expressed as a central difference), this 
procedure is called Aitken’s 5? process. 

As an example of the use of this technique, we can use the data in the 
second column of Example 8.1. Using x5, x¢, and x; in (8.7-4), we obtain as 
the new approximation 1.89554, which is a substantial improvement over 
any of the values used. Because of the simplicity of this procedure, it should 
always be used to accelerate the convergence of linear iterations. Another 
application of this procedure is in the determination of the multiplicity of a 
multiple root {31}. 

For iterations whose order of convergence is greater than 1, (8.7-4) 
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should not be used. For such iterations it is sometimes possible to speed up 
the convergence using so-called self-acceleration procedures. A discussion of 
these is beyond our scope here; see Traub (1964, pp. 185-187). 


8.8 SYSTEMS OF NONLINEAR EQUATIONS 


The system of nonlinear equations 


F(x, x00. x™)=0 jHl...,n (8.8-1) 

can be rewritten in vector notation ast 
f(x) = 0 (8.8-2) 
where f=(f. fo fy) x Hx x] (8.8-3) 


Using the form (8.8-2), we can then derive functional iteration methods as in 
Sec. 8.4. The general form of the functional iteration equation for stationary 
iterations is 


Xi+. = F(x;, Kji- iy ees X;-p) (8.8-4) 
where F=[F, F, -:: F,] (8.8-5) 


We suppose that in some neighborhood of the solution & = (a, ..., ay) of 
(8.8-2) the vector function f has an inverse 


g=([9:1 92 °° 9 (8.8-6) 


Then using the notation y = (y", ..., y™) for the point inverse to x, we 
expand g(y) in a Taylor series about y; [see Apostol (1957), pp. 123-124]. 


x = g(y) = g(y;) + ¥ del y—Yyi) 
I qmt2 
+ (m+2)\4 g(§; y — yi) (8.8-7) 


where & lies on the line segment joining y and y; and the jth-order differential 
is defined by 


BACK; 8)= YY YL Dinsin, nishl)s?s! + (88-8) 


i1=1 i2=1 i= 


where D,,, .. ;, h(x) is the partial derivative of h with respect to the variables 


ft For convenience and because it causes no confusion, we shall not distinguish between row 
and column vectors in this section. 
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Xi) +++, X;, at the point x and s“”, ..., s“” are components of s. We assume, of 
course, that all needed partial derivatives exist. Then, since a = g(0), setting 
y = 0 in (8.8-7) and dropping the remainder term, we get the equation 
analogous to (8.4-7). For example, with m=0 we get the n-dimensional 
analog of the Newton-Raphson method 


n re) 
Xi+1 =X; + dgly;; —y;) =x; — » yu) g(y;)y? (8.8-9) 
j=1 OY 
which can also be written as {32} 
Xin. =X; — J (xi) f(x) = x; — Jif (8.8-10) 
where the Jacobian 
of. 
J, = (25) _ (8.8-11) 
In particular, for n = 2, we use the explicit inverse of J; to get 
H AY, 
1 dx'?) Ax'2) |p! 
Xj44 = Xj; — D / A Of, , (8.8-12) 
Ax) ox) 2 _ 
fio 
Ox™) = Ax) 
where D= 8.8-13 
%, as BET) 
Ox™) Ax") 


As in the one-dimensional case, the order of convergence is 2 {33}. Here, 
however, we must note that the problems in the use of functional iteration 
methods with systems of equations are quite different from those for single 
equations. For single equations we have noted that good a priori informa- 
tion on the location of the root is often available; when it is not, we can use 
an always-convergent method to obtain a good approximation to the root. 
In this case we were therefore mainly interested in the efficiency of methods 
and comparatively little worried about whether or not a method converged. 
But with systems of equations convergence itself is such a serious problem 
that usually we shall be satisfied with any order of convergence if only the 
method will converge. Any reader who doubts this should try using the 
Newton-Raphson method to solve two simultaneous polynomial equations 
of degree 2 in two variables. Often, if the initial approximation is not quite 
close to the solution, the iteration will not converge {35}. The form of 
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the error in (8.8-12), which we leave to a problem {33}, would make 
the reason for this clearer. 

Before we tackle the problem of finding a good initial approximation, 
we shall indicate some modifications and generalizations of Newton’s 
method. All these methods will be based on the iteration formula 


Xin. =X; — 6 A; f; (8.8-14) 


where H;isann x n matrix determined by the particular method used and ¢; is 
a scaling factor, which is usually taken to be either unity or such that 


lfi+1 {| < [If | (8.8-15) 


for some convenient norm. In the latter case, t; is usually determined by a 
one-dimensional search along the line x; — tH;f;. In some methods, we try 
to find t; such that |lf,, || = min,, 9 f(x; — tH;f,)|| or approximately so. 
Since this may be too time-consuming, we may search only until we find a t 
such that (8.8-15) holds. 

Newton’s method (8.8-10) is of the form (8.8-14) with H; =J;* and 
t; = 1. With this H,; but with t; chosen so that (8.8-15) holds, we obtain the 
so-called damped Newton method. In another variation of Newton’s 
method, the modified Newton method, we do not recompute the value of the 
Jacobian J(x) at each iteration point x;. Instead we use the same value of 
the Jacobian for a fixed number m of iterations, and if convergence is not 
reached within m iterations, we compute a new value. 


Example 8.7 Solve the system of equations 
Si (x, y)=x?-y- 1=0 
f(x, y) = (x — 2)? + (y— 5)? - 1 =0 


using (8.8-12). 
This system has two roots 


r, = (1.54634 28833 2, 1.39117 631279) 
r, = (1.06734 60858 1, .139227 66688 7) 


The Jacobian matrix of this system is 


and (8.8-12) becomes 


_ 2_ y 
Mien} q a I | 2yj,-1 1 rl xp -y-1 (8.8-16) 
Viv Yi 4xiy,- 4|-—2x,+4 2x,] | (x; — 2) + (y,- 5-1 


In the following table, we give various starting values, (x,, y,), the root to which (8.8-16) 
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converged with that starting value and the number N of iterations required to achieve 
accuracy of 12 significant figures: 


(x;,y1) Root N 


(0, 0) ri 7 
(.1, 2) ry 25 
(1, 0) ri 4 
(15,1) 4, 5 
(2, 2) ry 5 


We shall now show that the method of steepest descent can also be 
written in form (8.8-14). This method is used to solve a related minimum 
problem. Let 


F(x) = f"(x)f(x) = DUR. ven XT}? (8.8-17) 


The function F takes on its absolute minimum, 0, at a solution of (8.8-1). 
Therefore, if we can find an absolute minimum for F, we shall have solved 
(8.8-1). If we define the gradient vector of a function g(x) by 


og ag 
Ox, OXn 


Vg(x) = (8.8-18) 


then at the point x;, g(x) decreases most rapidly in the direction — Vg(x;). 
To minimize g(x), the classical method of steepest descent searches along 
the direction — Vg(x;) to find the point x;, , which minimizes g(x). However, 
to find the minimizing point x;,, may require too much work as measured 
by the number of function evaluations. In many applications, it suffices to 
find a point x;,, at which the magnitude of g(x) is less than at x;. This 
is always possible if Vg(x;) # 0, that is, if x; is not a stationary point of 
g(x). In this case there is always an interval IJ; = (0, 7;) such that if 
te 1;, |g(x; — t Vg(x;,))| < |g(x;)|. Here T; depends on the function g as well 
as on the point x; and is not usually known in advance. We also note 
that, by continuity, if d is a direction close to that of Vg(x;), then there 
exists an interval I(d) such that if te I(d), |g(x; — td)| < |g(x;)|, and as d 
approaches Vg(x;), [(d) approaches J;. 

If we now identify the function F(x) with g(x), we have that VF(x) = 
2J7(x)f(x) {36}. Hence, with the choice H;=J/, 0<t;<2T,, (8.8-14) 
becomes the method of steepest descent and (8.8-15) is satisfied. 

One problem with the method of steepest descent, which 1s always men- 
tioned as a flaw of this method, is that we may converge to a local minimum 
x* of F(x), where f(x*) 4 0. In response to this, we point out that when this 
occurs, J"(x*)f(x*) = 0, and since f(x*) # 0, this implies that J(x*) is singu- 
lar. But singularity of the Jacobian causes other methods, e.g., Newton’s 
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method, to break down too, so that this weakness is not special to steepest 
descent. 

Whereas Newton’s method converges quadratically in a neighborhood 
of the solution and steepest descent has only linear convergence, neverthe- 
less the former method requires a good initial approximation to the solution 
if it is to converge at all, while the latter will converge from any initial 
approximation (if the Jacobian is nonsingular everywhere). Since, in prac- 
tice, the convergence of steepest descent is very slow, it can be used initially, 
and when we get close to the solution, we can switch to Newton’s method. 
The catch is, of course, to decide when we are sufficiently close to ensure that 
Newton’s method will converge. 

Another way to combine the methods of Newton and steepest descent is 
via the Levenberg-Marquardt algorithm, in which we set 


H; = (JJ, + At) JT t= 1.0 
A, 20 (8.8-19) 


For J; =0, this yields Newton’s method, whereas as /; increases, the step 
length decreases {37} and the direction tends to that of steepest descent. 
Since H; = (1/4; + 4; 'J/J,)" ‘U7, so that H; > (1/A,;)J/ as 4; > 00, it fol- 
lows from our remarks above that for A; sufficiently large, (8.8-15) will hold. 
The strategy in the use of this algorithm is thus to take A; initially large and 
to reduce it as the solution is approached. Note that for A; > 0 the inverse 
matrix always exists {37}. Thus, in effect, we use the method of steepest 
descent to get a good initial approximation to Newton’s method. 

An alternate way to get a good initial approximation to the solution of 
(8.8-1) is by Davidenko’s method. Let g(x, t) be a function of x and t such that 
g(x, 0) = f(x) and g(x, 1)=0 has a known solution. Then, starting with 
to = 1, we choose a sequence t; 


t= 1>t,>t,>°::>t,=0 
and solve in succession 
g(x, t;) = 0 (8.8-20) 


for x; using as initial approximation either x;_, or an extrapolation from 
several previous values of x;. It follows then that the solution of g(x, t,) = 0 
is also a solution of f(x) = 0. If t; is close to t;_,, the initial approximation 
should be good enough to ensure that (8.8-20) can readily be solved by 
Newton’s method or one of the variations discussed below. The problem of 
finding an appropriate function g(x, t) is quite simple. In fact, if x, is an 
initial estimate of the solution of (8.8-2), then two possibilities for g(x, t) are 


g(x, t) = f(x) — rf(x,) (8.8-21) 
and g(x, t) = (x — x,)t + (1 — e)f(x) (8.8-22) 
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Davidenko’s method has connections with differential equations, which will 
be explored in a problem {38}. 


Example 8.8 Solve the system of equations discussed in Example 8.7 by Davidenko’s 
method using (8.8-21) with the sequence of {t;} given by {.95, .9, .8, .6, .3, O}. 

For each value of t;, k iterations of Newton’s method were performed with k = 2, 3, 
4. For t; = .95, the initial value x,(t,) for the Newton iterations was the corresponding 
initial value used in Example 8.7. For t, = .9, the initial value was the result after k 
applications of the Newton iterations for t, = .95 that is, x,(.9) = x,.,(.95). For the re- 
maining t¢,, the initial values x,(t,) were computed by the extrapolation formula 


X1(t,) = Xeaa(ti-a) + [Xeaalt—1) — Xe4 1 (t.-2)) (8.8-23) 
where X, 4, (t;_,) 1s the solution for t;_, after k iterations. 

In the table below, we give the initial values (x ,, y,), the root to which the final result 
appears to converge, and the number of correct significant figures in each component for 


each k. 
(x1, ¥1) Root k=2 k =3 k=4 
(0, 0) ry (8, 6) (12, 12) (12, 12) 
(.1, 2) ry (3, 1) (4, 3) (7, 6) 
(1, 0) ry (8, 6) (12, 12) (12, 12) 
(1.5, 1) ry (6, 5) (10, 10) (10, 10) 
(2, 2) ry (5, 5) (9, 4) (12, 12) 


Newton’s method and its modifications described above, the method of 
steepest descent and the Levenberg-Marquardt algorithm, all require the 
computation of the Jacobian J;. However, even for moderate values of n, the 
differentiations and programming effort needed to evaluate J; can become 
quite tedious, and for large values of n, even when a system for analytic 
differentiation is available, the computing time and storage requirements 
may become excessive. Thus, in practice, we usually approximate J; by 
another matrix B,;, which is more easily computed. One way of defining B; 1s 
by replacing the partial derivatives appearing in J; by finite-difference 
approximations of the form 


Aj ~ Silxi + ei) — FAX) — pti (8.8-24) 


OXk |x= x, h; 


where e, is the kth column of /,, the identity matrix of order n, and h; 
is small in absolute value. How small h; should be is a problem since if h; 
is too large, there will be a large truncation error, while if h; is too small, 
a substantial roundoff error can occur. In any event, if we wish to retain the 
quadratic convergence of Newton’s method, it has been shown that h; must 
be chosen so that 


|h;| < Cllf;| (8.8-25) 


for some positive constant C. 
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While this discretization of the Jacobian may save on storage and com- 
puting time, it is still very expensive since it involves n? functional evalua- 
tions. One way of skimping on this is to retain the old Jacobian (or its 
approximation) over a series of iterations, updating it, for instance, after 
each m steps for some suitable m. This reduces the rate of convergence but 
should be more economical in the long run. 

A better scheme is to generate an approximation B,,, to J;,, at step 
i+ 1 by updating the current approximation B; to the Jacobian J; without 
further function evaluations. This possibility introduces a great economy in 
the computation. There are a variety of ways to do this, as we shall presently 
see. If we let q be an arbitrary vector, then 

n 
S(xi+ 4) -—S(x) ~ ¥ Le) j=l,...,n (8.8-26) 


l=1 
Hence, if we can find a matrix B such that 
f(x; + q) — f(x,;) = Bq (8.8-27) 


then B will be a candidate for an approximation to the Jacobian J;. Now, 
one can satisfy (8.8-27) simultaneously for a collection of vectors q, provided 
they are linearly independent. For example, if q, = e,h,, k = 1, ..., n, then 
B = B, = [b{2], with b® given by (8.8-24), satisfies (8.8-27). 

Now, at step i + 1 we have already computed x;,, and f,,, = f(x;4 1). 
Hence, a natural choice for q is x;,, — x; since this does not require any 
further function evaluations. For this choice of q, our previous approximation 
B, will, in general, not satisfy (8.8-27). However, there exist many matrices 
which do satisfy (8.8-27), and from this class we select our new approxima- 
tion B;,,. Our first requirement, then, is that 


Bi+14: = Vi (8.8-28) 
where v,=f,.,—f; G; = X;41 —X; (8.8-29) 


Since the n equations (8.8-28) are not sufficient to specify the n? elements of 
B;,,, various methods can be defined by imposing additional conditions. 
The generalized secant procedure {39} chooses B;,, to satisfy the n — 1 
additional sets of n equations 


Bi41 4; = V; j=i-1li-2,...,i-—n+1l (8.8-30) 


but this can only be applied for i > n. A modified secant procedure requires 
that 
Bi +14; = Vj (8.8-31) 


for n — 1 additional vectors q, previously generated, j < i, where the q; sat- 
isfy some criteria of linear independence. 
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The most sophisticated procedure is that of Broyden, which defines B;, , 
by the equation 


(B;q; — v;)a/ 
qi 4; 
for which (8.8-28) is also satisfied. Since the rank of B;,, — B; is 1, this is 


called a rank-one updating procedure. This formula has the advantage that if 
A; = B; ', then {42} 


By, = By - (8.8-32) 


(A; Vv; — qi)qi A; 


Aina = Arm q) A; V; 


(8.8-33) 
Thus, in those methods which require the inverse of the Jacobian, J; ', we 
can avoid the computation of the inverse matrix involving O(n°) operations 
by working with the A; instead. For example, corresponding to Newton’s 
method, we have a quasi-Newton method given by the iteration 


X44 = X; — A; f(x;) (8.8-34) 
with A, = J; ' and A; given by the updating formulas (8.8-29) and (8.8-33). 


Example 8.9 Solve the system of equations discussed in Example 8.7 using the matrix- 
update method with the same set of initial values. 

The results corresponding to those in Example 8.7 for the number of iterations 
needed to achieve accuracy of 12 significant figures are: 


(x1, yi) N 
(0, 0) 11 
(.1, 2) 14 
(1, 0) 7 
(1.5, 1) 8 
(2, 2) 9 


As an example of the details of the calculation, we shall follow two steps in the 
computation starting with x, = 0, y, = 0. We have that 
“4 25 —.25 
Silt yi)=-1 f2(% 1) 1) = 3.25 A,y=J, = 4 0 


By Newton’s method, x, = 1.0625, y, = —1, and f,(x2, yz) = 1.12890625, 
f2(X2, ¥z) = 2.12890625. From these values, we use (8.8-29) and (8.8-31) to compute 


| ae | | t piosy7s| 
_ = 


_ d 
$=} _y ~1,12109375} 


_ [3957441 ++» —.2721932 + 
2 | —.§224991 --- —.1002162 -:: 


For comparison purposes, we note that the inverse of the true Jacobian is 


je 3636363 -*- 1212121 a" 
2 | 2272727 «++ = —.2575757 
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Using the pure Newton method with J;', we find that x, = .91003 78787 ..., 
y, = —.1950757575::- while for the update method, x, = 1.240372062..., 
y3 = —.19679 74670 ---. We see that the difference between the values of x, is substantial 
while that between the y, is not. 

In general, the results using the matrix-update method are worse than the results 
using Newton’s method, except for the unusual case of x, = .1, y, = 2, in which case the 
matrix-update method required fewer iterations for convergence. For two equations the 
update algorithm is much more complicated than Newton’s method, and, in general, 
matrix-update methods are of interest only for moderate and large values of n. 


Many of the above ideas, together with some not mentioned here, have 
been incorporated into algorithms and programs for solving systems of 
nonlinear equations. These programs have been tested on a variety of 
suitably chosen test cases, and modifications, refinements, and improve- 
ments have been introduced in the light of the experience gained with these 
programs. While the current situation is far from ideal, the solution of a 
system of nonlinear equations is far from a hopeless task even when there is 
no good initial approximation to the solution. 


8.9 THE ZEROS OF POLYNOMIALS: 
THE PROBLEM 


In the remainder of this chapter, we shall be concerned with finding the 
complex and real roots of 


f(z) =a, 2" + a,-,2""-' +++ +a,z + a) =0 (8.9-1) 


where the coefficients a;, i= 0, ..., n, are real numbers} and z 1s a complex 
variable. The methods of functional iteration discussed previously in this 
chapter can all be applied to finding the real roots of (8.9-1) and generally, 
with simple modifications, they can be applied to complex roots. However, 
the problem of finding the roots of (8.9-1) arises with such frequency that 
this alone justifies looking for methods particularly adapted to this problem. 
The particularly simple form of f(z) in fact greatly aids us in finding such 
special methods. Also the need to find complex as well as real roots and 
often the need to find all the roots of (8.9-1) add another dimension to the 
problem which merits special attention. 

The need to find all the roots of (8.9-1) commonly arises, as we have 
seen, for example, in Sec. 5.4 in the consideration of stability problems. 
Generally in such cases there is no good a priori information about the 
location of all or, sometimes, any of the roots. This implies the need to 
emphasize, more than we have done previously, methods which are always 


+t Many of the methods to be discussed are also applicable when the a,’s are allowed to be 
complex, but the case of real coefficients is by far the most important. 
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convergent, particularly for high-degree polynomials. We note that the 
methods of false position and bisection are not applicable to the case of 
complex roots. Our approach will be to consider the basic ideas of the 
classical methods, some of which may converge slowly but surely while 
others converge rapidly if the initial point is sufficiently close to a root. We 
shall then consider two modern algorithms which have been incorporated in 
current polynomial root-finding programs and which have yielded high- 
quality performance in speed and robustness. 

But before considering directly methods for the solution of (8.9-1), it is 
important to note that there is a large literature on the location of the zeros 
of a polynomial as a function of its coefficients. The problem of solving 
(8.9-1) cannot be properly attacked without some knowledge of the 
theorems on the location of the zeros of polynomials. A general discussion of 
these theorems is beyond our scope here; see Marden (1966), Wall (1948), 
and Wilf (1962). One aspect of the location of real roots is so important both 
for the problems of this chapter and in other areas (see Sec. 10.3-2) that we 
shall discuss it in some detail here. 


8.9-1 Sturm Sequences 


Definition 8.1 Let 
F(x) falx), +05 Sin(X) (8.9-2) 


be a sequence of polynomials. Such a sequence is called a Sturm sequence 
on an interval (a, b) where either a or b may be infinite, if (1) f,,(x) does 
not vanish in (a, b); (2) at any zero of f,(x), k = 2, ..., m— 1, the two 
adjacent functions are nonzero and have opposite signs; that is, 


Sia 10%) figs 1%) < 0 (8.9-3) 


Definition 8.2 Let {f,(x)}, i= 1, ..., m, be a Sturm sequence on (a, 5), 
and let x, be a point of (a, b) at which f,(x) # 0. We define V(x,) to be 
the number of changes of sign of { f;(x9)}, zero values being ignored. Ifa 
is finite, then V(a) is defined as V(a + €), where e€ is such that no f;(x) 
vanishes in (a, a+ ¢€) and similarly for b when b is finite. If a= — 0, 
then V(a) is defined to be the number of changes of sign of 
{lim,.._  f;(x)} and similarly for V(b) when b = +00. 


Definition 8.3 Let R(x) be any rational function. We define the Cauchy 
index of R(x) on (a, b), denoted by /° R(x), to be the difference between 
the number of jumps of R(x) from —oo to +00 and the number of 
jumps from +o to — oo as x goes from a to b, excluding the endpoints. 
That is, at every real pole of R(x) in (a, b) add 1 to the Cauchy index if 
R(x) — —oo on the left of the pole and R(x) > +00 on the right of the 
pole and subtract 1 if vice versa. 
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With these three definitions, we can prove the following theorem. 


Theorem 8.2 (Sturm) If f;(x), i= 1, ..., m, is a Sturm sequence on an 
interval (a, b), then if neither f,(a) nor f,(b) equals 0, 
F(x) 
p>=2") — y(a) — V(b 8.9.4 
Fe) = Va) — V0) 8.9-4) 


PROOF The value of V(x) does not change when x passes through a 
zero Of f,(x), k = 2,...,m, because of (8.9-3). Thus V(x) can change only 
when f,(x) goes through 0. If x9 is a zero of f,(x), it is not a zero of f,(x) 
because of property 2 of Sturm sequences. Therefore, f,(x) has the same 
sign on both sides of x9. If x9 is a zero of f,(x) of even multiplicity, then 
V(x) does not change as x increases through x, and there is no contribu- 
tion to the Cauchy index. If the zero is of odd multiplicity, then V(x) will 
increase by 1 if f(x) and f,(x) have the same sign to the left of x, and will 
decrease by 1 if the signs to the left are different. Correspondingly for 
zeros of odd multiplicity, there is a — 1 contribution to the Cauchy index 
if the signs of f,(x) and f,(x) are the same to the left of x, and a+ 1 
contribution if they are different. This establishes the theorem. 


Our chief interest here is in applying this theorem to find the real roots 
of (8.9-1) in an interval (a, b). Consider the sequence of functions f;(x), i = 1, 
..., m, where 

filx)=f(x) fax) =f'(x) 

f(s) = 4j-a00).G0) fine) f= 2-,m—1 (89-5) 

Fn—1(%) = Gm —1(%) fn) 
where q;- ,(x) is the quotient and f;, ,(x) is the negative of the remainder 
when f;_ ,(x) is divided by f,(x). Thus { f;(x)} is a sequence of polynomials of 
decreasing degree which eventually must terminate in a polynomial f,,(x), 
m<n-+ 1, which divides f,,- (x) (why?). The polynomial f,,(x) is the 
greatest common divisor of f,(x) and f(x) and also of every other member of 
the sequence (8.9-5). Now suppose f,,(x) does not vanish in (a, b) so that the 
first condition of Definition 8.1 is satisfied. But in this case, the second 
condition is also satisfied since, if f(x) = 0 for any j, j = 2,..., m — 1, then 
fy-1(x) = —fj+ 1(x). Moreover, when f;(x) = 0, fj41(x) 4 0 since if it were 0, 
fn(x) would also be 0 (why?). Thus the sequence { f;(x)} is a Sturm sequence 
when f,,(x) does not vanish in (a, b). 

If f,,(x) is not of constant sign in (a, b), then, in place of (8.9-5), we use 
the sequence { f;(x)/f,,(x)}, i= 1,...,m. Not only is this a Sturm sequence but 
also both sides of (8.9-4) are the same for this sequence and for the sequence 
(8.9-5) (why?). Therefore, we can use these two sequences interchangeably in 
applying Sturm’s theorem. 
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Now for the sequence (8.9-5) we write 


fx(x) — f(x) Se 
= = + R,(x (8.9-6 
filx) f(x) ers 1) 
where the a;, j = 1, ..., p, are the distinct real zeros of f(x), n, is the multi- 


plicity of the zero a;, and R,(x) has no poles on the real axis {43}. Since the 
n, are all positive, I2[ f’(x)/f (x)] is equal to the number of distinct real zeros 
of f(x) in the interval (a, b). Therefore, we have the following theorem. 


Theorem 8.3 The number of distinct real zeros of the polynomial f (x) in 
the interval [a, b] is equal to V(a) — V(b) if neither f (a) nor f(b) is equal 
to 0. Moreover, if f(a) or f(b) or both are equal to 0 and the root is 
simple, the result holds on [a, b] if we define V(x) to be the number of 
changes of sign in f,(x), ..., f,,(x) when f(x) = 0 {44}. 


This result can be extremely useful in locating the roots of (8.9-1). If, for 
example, we are interested only in the real roots of (8.9-1), this theorem 
enables us to determine exactly how many such roots there are. In fact, by 
making use of f,,(x), we can use this theorem to help find the multiplicity of 
these roots {44}. 


Example 8.10 Apply Sturm’s theorem to finding the number of real zeros of 
f(x) = x® + 4x® + 4x* — x? — 4x — 4 = (x? + 1)(x? — 1)(x + 2) 
Using (8.9-5), we calculate 


f(x) = x® + 4x° + 4x4 -— x? -—4x-—4 f(x) = 6x> + 20x* + 16x? — 2x — 4 


fy(x) = 4x* + 8x39 + 3x? + 14x + 16 fa(x) = x° + 6x? + 12x + 8 
s(x) 
f(x) = —17x? — 58x — 48 fe(x) = “x -2= 


where the coefficients have been made integers by multiplying by suitable positive 
constants. For some sample values of x the signs of the f(x) are 


— 00 oe) 0 —1 +1 17 

fi (x) + + ~ 0 0 + 
f(x) ~ + ~ ~ + 7 
f;(x) + + + + + _ 
fa(x) — + + + + + 
f(x) — ~ ~ ~ ~ 0 
f(x) + ~ ~ ~ 7 
Number of 

sign changes 4 l 2 2 l 3 


Thus we have three distinct real zeros, two negative real zeros, and one positive real zero. 
Although — 1 and +1 are zeros, the rule above shows that there are two distinct zeros in 
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(—0o, —1] and three in (— oo, +1]. The point — 44 illustrates the case when an f(x) = 0. 
For example, in (— 00, — 75], there is one distinct zero. Since f,(x) = —x — 2, the zero at 
—2 is a double zero {44}. 


8.10 CLASSICAL METHODS 


Before considering methods for finding zeros of a polynomial, we recall the 
algorithm given by Eq. (7.2-4) for evaluating a polynomial. To evaluate a 
polynomial f(z) given by (8.9-1) at the point z= z,, where z; may be 
complex, we can use the recursion 


b,-1 = Gp by = Age + Zjdu4 4 k=n-—2,..., — | (8.10-1) 
with f(z,;) given by b_,. The coefficients b,, k=n—1, ..., 0, are the 
coefficients resulting from the division of f(z) by z — z;, so that we have 

f(z) => (z — Z;)(b_- iz} + b, 52"? + an + bo) + R; (8.10-2) 
where R,=f(z;)=b_,. This can readily be verified by equating the 
coefficients of the same power of z on both sides of (8.10-2) and comparing 
with (8.10-1). The equation (8.10-1) is said to yield a synthetic-division algo- 
rithm for dividing a polynomial by a linear factor. 

The quotient in (8.10-2) can again be divided by z — z, to give 
f (2) = (z — 2)? (Ch-22" 7 +°°° +.€0) + (2 — 2)R; +R;  (8.10-3) 
where the c, are given by the recursion 
Ch-2 = dy-} Cy = One + 2jCKa1 k=n-—3,..., — | (8. 10-4) 
and Ri = c_,. Clearly Rj = f'(z;). 
This can be repeated, up to n times, and at stage s we have 


s-l1 


f (2) = (2 — 2 Pn-s(2) + ) (2 — 2;)RP (8.10-5) 


(=0 
where p,,_,(z) is a polynomial of degree n — s and the successive remainders 
R“ are the reduced Ith derivatives at z = z,, 

RP = FOE) (8.10-6) 


Thus, by s + 1 repetitions of the synthetic-division algorithm, we can 
compute the value of f(z) and its first s reduced derivatives at the point 


Z = Z,, using a total of 
n+ (n— 1) to (n—sy= PE NEn=S) 


multiplications and additions. However, we should note that there exists an 
algorithm {47} which requires only 2n — 1 multiplications, s divisions, and 
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(s + 1)(2n — s)/2 additions, which is a considerable saving if higher deriva- 
tives are required. 

Now, if z, is a zero of f(z), Rj = 0 and the resulting quotient in (8.10-2) is 
the deflated polynomial of degree n — 1, that is, the polynomial whose zeros 
are identical with the remaining n — 1 zeros of f(z). Thus, once we find a 
zero z, of f(z) by any method, synthetic division by z — z; yields the deflated 
polynomial of degree n — 1. To find additional zeros of (8.9-1), the root- 
finding method can then be applied to this new polynomial. However, it is 
advisable to check first whether z, is a multiple zero by applying algorithm 
(8.10-1) to the deflated polynomial. 

Now, if f(z) is a real polynomial and we have found a complex zero 
z,;= x, + iy;, we know that Z; = x, — iy; is also a zero, so that 


(z — z;)(z — Z;) = 77 — pjz2+ q; pj = 2x; 
qj= x7 + y} (8.10-7) 


is a real factor of f(z). Hence, since we are interested in performing all our 
arithmetic in the real domain, it is of interest to develop an algorithm for 
synthetic division by a quadratic factor. We then have 


f(z) = (z? + pjZ + q;\(by-22" 7 terse + bo) + b_,(z + P;) + b_, (8.10-8) 


where the b, are given by the recursion 


b,-2 = 4, b,—3 = An—1 — PjOn-2 


by = Ans. — Pines — Gj On+2 k=n—4,...,0, —1, —2 


The remainder can also be written in the form R,;z + S;, with 


R; = a, — pjbo — jb, = b-, S; = Ay — Gjbo = b_2 + pjb-, 
(8.10-10) 


and if z7 + p;z +q, is a factor of f(z), then R; = S; = 0. In this case, the 
quotient polynomial in (8.10-8) is the deflated polynomial of degree n — 2. 

We should point out that deflation can be an unstable process unless 
proper care is taken [see Wilkinson (1963), pp. 55-62]. The simplest way to 
avoid such instability is to evaluate the zeros of f(z) in ascending order of 
magnitude, in which case deflation is quite stable. Alternatively, when the 
zeros are evaluated in descending order of magnitude, a different deflation 
algorithm exists, which is also stable {51}. In any case, it is advisable, once 
approximations to all zeros of f(z) have been computed, to use these approx- 
imations as starting values of an iterative process such as Newton’s method 
using the original polynomial. This is called purifying the zeros. Usually, one 
iteration will suffice to obtain the desired accuracy. 
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As we shall see in Sec. 8.12, Newton’s method forms the basis of an 
efficient algorithm for finding all the zeros of a polynomial. However, not 
only Newton’s method but all iterative methods discussed in the first half of 
this chapter, except those based on change of sign, can be used to find the 
complex zeros of polynomials. While the derivation of the order of a method 
in this case 1s beyond the scope of this book, nevertheless it can be shown, for 
example, that, as with real roots, Newton’s method is of second order near a 
simple complex zero while the secant method is of order (1 + J 5)/2. The 
main disadvantage of these iterative methods is that they require a good 
initial approximation. When one is available, as in the situation described 
above, these methods are quite good. 

Most of these methods for calculating complex zeros will not converge 
to complex roots unless we take complex values as initial approximations. 
However, Muller’s method {21} has the desirable feature that it can generate 
a complex approximation starting with real ones. Thus, the search for the 
zeros is not influenced by the prejudices of the user. 

If we have a real polynomial and a complex approximation z,, then the 
algorithm for computing f(z;) can be expressed using real arithmetic only as 
follows. Let 


b=, +i5, i= /—I (8.10-11) 


Ck = Cx + in, 


Then (8.10-1) and (8.10-4) become 


Ve = Ast + XjVeo1 — VjOner k=n-—1,...,0,-1 


Yn = 6, =0 
On = XjOn+1 + VjVe+1 k=n-— 2, ..., 0, —1 (8.10-12) 
Ex = Vara + Xplet1 — Vier On-1 = x-1 = Mn-1 = 9 


Ne = One, + XjiMeti + Viens k=n-—3,...,0, -1 


Mn-2 = 9 
with R;=b_,=y_,+ i6_, Ri=c_,=¢€_-,+in_,  (8.10-13) 
Thus Newton’s method takes the form 
X01 = Seater + e101 Vier zy — eS es 


ea, tn, 
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8.10-1 Bairstow’s Method 


In the case of a real polynomial, we know that the complex zeros occur as 
conjugate pairs. Hence, instead of looking for one zero at a time, we may look 
for pairs of zeros which generate a real quadratic factor. This is the basic 
idea of the iteration of Bairstow, which assumes a good initial approximation. 

When p and qare used instead of p; and q;, Eqs. (8.10-10) can be written 


R(p,q)=9 = S(p, gq) = 0 (8.10-15) 


where b, and by are also functions of p and q. In the Bairstow iteration, these 
two simultaneous equations for the two unknowns p and q are solved using 
the Newton-Raphson method for simultaneous equations (8.8-12). Let p; 
and q; and p;,, and q;,, denote, respectively, the results of the ith and 
(i + 1)st steps in the iteration. Then from (8.8-12) we have 


1|_ oS OR 
Pi+1 = Pi — IRS ~ 8 en 
1] OR ch) 
fi+1 =a-5| op a lece (8.10-16) 
aR 0s 
_ {Op op 
where D= aR as (8.10-17) 
O0q 0q\ b=hi 
Now using (8.10-10), we can write 
OR _ dbo _ bs _ OR _ Abo On 
dp op ap 8 Og gg 
oh) dbo 0S db_, 0b_, 
“~ = —g—?® — =—__* 8.10-18 
Op 4 Op Oq  0q Oq ( 


From (8.10-9) 


Ob, Oby +1 Ob; +2 
—_—- = _ = _ 3 soy 0, —1 
ap by+1 — P ap q ap k=n-— 3, 
Obn—2 Obn- 1 0 (8.10-19) 
Op p 
Ob, Oby +1 Ob, +2 
—*— — — =n—4...,0 1, —2 
dq by +2 P dq éq n ’ >“ 


—"=2, 0  (8.10-20) 
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If we define d, by the recurrence relation 


dy = —byay — Paes — Gays 2 k=n-—3,...,0, -1l 


d,-2 =d,-, =0 (8.10-21) 
then it follows from (8.10-19) and (8.10-20) that 
op “aq k=n-—3,...,0, -1 (8.10-22) 
and finally that 
OR OR oS os 
op 4! aq bp 40 iq oo} + pdo 
(8.10-23) 


Therefore, (8.10-16) and (8.10-17) become 
1 


Pi+1 = Pi — — [b_,(d_, + pido) _ (b_» + pib-1)do] 


D 
l 
Gi+1 = 4G — Zl -2 + pjb_,)d_, + do b_ 1q;] (8.10-24) 
where D => d*, + Dj dyd_, + q; a2 (8.10-25) 


Example 8.11 Use the Bairstow iteration to find a quadratic factor of z> — z — 1 starting 
with p, = q, = I. 
We arrange the calculation in the form 


(Pi> 4) 


with (8.10-9) and (8.10-21) used to calculate the last two columns. For this problem we 
have, using (8.10-24) and (8.10-25), 


(pi. 41) = (1, 1) (p2, 42) = (4, 4) 

1 i =| i = 

0 —| 2 —4 8 
~1 1 1 
7 


—1 J 


be 
~ 


and finally p, = 1.3246, q, = .7544, whereas the true values are p = 1.3247, q = .7549. 
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When it converges, Bairstow’s method has the characteristic rapid con- 
vergence of the Newton-Raphson method. But, as is usually the case in 
solving simultaneous nonlinear equations, convergence requires quite a 
good initial approximation {53}. 


8.10-2 Graeffe’s Root-squaring Method 


The essence of Graeffe’s method is to replace (8.9-1) by an equation, still of 
degree n, whose roots are the squares of the roots of (8.9-1). By iterating this 
procedure, roots of (8.9-1) which are unequal in magnitude become more 
widely separated in magnitude. By separating the roots sufficiently we can, 
as we shall see, calculate the roots directly from the coefficients. When there 
are roots of equal magnitude, this process runs into difficulties, but they can 
be overcome. 

Let the roots of (8.9-1) be a;,i = 1,...,n. We assume in the remainder of 
this section that a, = 1. Then, writing f,(z) for f(z), we have 


fo(z) = (2 — a1 )(z — a) ++ (2 — O%) (8.10-26) 
Using this, we can write 
fi(w) = (—1)"fol2) fo(—z) = (w — af)(w — 03) +++ (w — on) 
w=2z* (8.10-27) 


so that the zeros of f,;(w) are the squares of those of fo(z). Therefore, the 
sequence 


frailw) = (-1)%(2) f-(-z) r=0,1,... (8.10-28) 


is such that the zeros of each polynomial are the squares of the zeros of the 
previous polynomial. If we denote the coefficients of f[(z) by af, j =0,...,n, 
we can show that {55} 
min (n— j, j) 
af =(-1y Jay) +2 Yo (-1faal),] (810-29) 


k=1 


To use the sequence of polynomials { f,(z)}, we need the well-known 
relationship between the coefficients of a polynomial and its zeros. This 
relationship is expressed by the equation 


a) = (—1)"YS,,_ ay’, 03, ..., 07”) =f =0,..., n-— 1 (810-30) 


where S,(x,, ..., X,) is the kth symmetric function of x,,..., x,. This function 
is defined by the equation 


Si(X15 065 Xn) = Ve Xp Xrg 1 Xry (8.10-31) 
i 
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where the notation ), denotes that the sum is over all combinations of k out 
1 


of the digits 1 to n in the subscripts. Thus, for example, 


n 
a” , = — S$, (a2, e ees 02") = — Ya" 
k=1 


a=p,e  k=1,...,n 


(8.10-32) 


(8.10-33) 


Suppose first that all the roots are distinct in magnitude and ordered so that 


P1 > P2>""' > Pn 
We write (8.10-32) as 


eon SLY 
n-1 — 1 


k=2\%% 


Then using (8.10-34), we have 


lim |a® , |/” 


ro 


= |, | 


Therefore, for sufficiently large r 
pi & lay? |?" 


Similarly, we have 


n 
(r) 27 ye” alae! 
a,-2 = Ye hr Ory = HX 1+ ye 
1 


(ri,r2)#(1, 2) 


and, therefore, for sufficiently large r 


1 a) 
~ (r) 1/2" ww n—-2 
P2~—_ ja”? 5| ~ 


ay 
Continuing in this way we have in general 


(r) 1/2° 
An—k 


(r) 
On-k+1 


Pr ~ 


1/2r° 


(8.10-34) 


(8.10-35) 


(8.10-36) 


(8.10-37) 


(8.10-38) 


(8.10-39) 


(8.10-40) 


In practice “sufficiently large r” means only that we must continue the 
root-squaring process until the approximations to the magnitudes have stab- 


ilized to the number of decimal places we want. 


When the roots are all separated, then once we have the magnitudes, 
determining the sign is easily accomplished by inserting the magnitude into 


(8.9-1). 
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Example 8.12 Use the root-squaring method to find the zeros of z? — 5z? — 17z + 21. 
Using (8.10-29), we calculate 


r at) a a att) 
l I —59 499 — 441 
2 I — 2,483 196,963 — 194,481 


3 | — 5,771,363 37,828,630, 723 — 37,822,859,361 


and from these we can estimate at each stage 


r P, P2 P3 


l 7.68 2.91 94 
2 7.06 2.98 .997 
3 7.001 2.999 .99998 


Insertion of these three magnitudes into the polynomial leads easily to the result that «, 
and «, are positive and x, is negative. The true roots are 7, — 3, I. 


The difficulties in the use of the root-squaring procedure arise when 
some of the roots have equal magnitudes. These difficulties are of two kinds: 
(1) The relations (8.10-37), (8.10-39), and (8.10-40) no longer are correct in 
general. Therefore, determining the magnitudes of the roots 1s more difficult. 
(2) Since some roots may be complex, it is no longer simple to determine the 
root given the magnitude. While both these difficulties are not hard to 
overcome, we shall not go into the details here. Similarly, we shall not 
discuss the computational aspects of this method since we are interested 
principally in the ideas behind the method of this section, not in its 
implementation. 


8.10-3 Bernoulli’s Method 


Consider the difference equation 
A,U, + An— 1 UR-1 +-+++aju,_, =9 (8.10-41) 
where the coefficients a;, i= 0, ..., n, are those of (8.9-1). If the roots «; of 
(8.9-1) are distinct, then the solution of this equation is given by {57} 
u, = Y Cc, at (8.10-42) 


where the c,’s depend on the initial conditions used to solve (8.10-41). If the 
roots are ordered in magnitude as in (8.10-34), then by rewriting (8.10-42) as 


u, = ca [ +3 “(2 (8.10-43) 


i=2C1 \O% 
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we have, if c, #0, 


lim —* =a, (8.10-44) 
ko UR-1 


The essence of Bernoulli’s method is to use (8.10-41) to compute successive 
values of u, and then to compute the ratio of successive values of u, until 
these ratios converge to a. 

For this method to work at all, it is necessary that c, #0. The c;’s 
depend, as we said, on the n initial conditions required by (8.10-41). If we 
generate these initial values using the equation 


Anim + An—1Um—1 + °°° + An—-m41 41 + Ma,-» = 9 m=1,...,n 


then it can be shown {56} that all c/s are unity and thus 


u,= > af (8.10-46) 
i=1 
Therefore, (8.10-44) always holds for this choice of initial conditions. 

The above was predicated on the assumption that «,, the root of largest 
magnitude, is real and distinct. Nevertheless, (8.10-44) also holds if a, is 
multiple but real {57}. But when the root of largest magnitude is complex, or 
when there is some combination of real and complex roots of largest magni- 
tude, (8.10-44) no longer holds. The number of possible special cases is 
therefore very large. Each such special case can be taken care of by a suitable 
modification of (8.10-44), but especially for automatic computation, it is 
extremely tedious to have to provide for all these cases. Moreover, if «, has 
nearly the same magnitude as a,, the convergence of the process is extremely 
slow (why?). Thus, as a general-purpose method, Bernoulli’s method has 
little in its favor. However, when the root of largest or smallest magni- 
tude [by considering f(1/z)] is the only one that is desired and is distinct, 
Bernoulli’s method can be very useful. 


Example 8.13 Use Bernoulli’s method to find the zero of greatest magnitude of the poly- 
nomial z? — 5z* — 17z + 21 of Example 8.12. 
Using (8.10-45), we have 


u,—5=0->4u, =5 
u, — Su, — 34 =0->u, = 59 
u, — Su, — 17u, + 63 =O>u, = 317 
Then using (8.10-41) in the form 
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we calculate 


u, = 2483 “4-783 us = 16,565 HS = 6.67 
uy Ug 
U6 uy 

ug = 118379  —=7.15 uz = 821,357 — = 6.94 
Us U6 
Ug Ug 

ug = 5,771,363 —2=7.03 uo = 40,333,925 2 =6.989 
uy Ug 


This example illustrates two characteristic features of Bernoulli’s 
method. First, it generally converges more slowly than the root-squaring 
method. However, because of the linear convergence of the method [note 
(8.10-44)], the 6? process can be used to accelerate the convergence. For 
example, when the first three approximations to the zero in the example above 
(7.83, 6.67, 7.15) are used, Eq. (8.7-4) gives 6.998 as the improved approxima- 
tion. The second characteristic feature of Bernoulli’s method is that, like the 
root-squaring method, the numbers involved grow very rapidly for roots of 
magnitude greater than 1. This latter problem can be avoided by working 
with the ratios u,_ , /u, {58}. 


8.10-4 Laguerre’s Method 
Suppose all the zeros of the polynomial f(z) are real and ordered such that 
a,<a,<°'' <4, n>2 (8.10-47) 
with «, < «,. We define 
I, = (oj, 044] i=0,....n %=—-o +1 = 
(8.10-48) 


Let x be an approximation to a zero of f (z). This approximation lies in some 
interval J;. The essence of Laguerre’s method is to construct a parabola with 
two real zeros in J; at least one of which will be closer to a zero of f(z) than x. 
It turns out that there are many such parabolas, depending on a real par- 
ameter A, and we shall so choose 4 that one of the zeros of the resulting 
parabola is as close as possible to a zero of f(z). 

For an arbitrary real A, let 


S(A) = D (2 — “i >0 (8.10-49) 
The equation 
b(y) = (x — y)’S(A) — (A— y)? =0 (8.10-50) 


has two real roots, y,, y,, which are distinct if 2 # x, which we shall hence- 
forth assume. If f(x) #0, it follows from (8.10-49) and (8.10-50) that 
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g(x) < 0 and that ¢(a;) > 0, i = 0, 1,...,2 + 1. Therefore, ifx ¢ /;,i=0,..., 
n, the two roots y,, y, both lie in J;, one between a; and x and one between x 
and a;,,. Now, apparently, to know @(y) as a function of A requires a 
knowledge of the roots «;. However, we shall show [cf. (8.10-54)] that this 
seeming dependence on the roots can be eliminated. 

We now wish to choose 4 so that one of the zeros of ¢(y) is as close as 
possible to a zero of f(z). This means that we wish to maximize |x — y| asa 
function of A or alternatively as a function of the real parameter p = A — x 
(why?). From (8.9-1), we can easily calculate {60} 


fe) & 1 | 
f(x) 4xca, (8.10-51) 
CoP Sie) _ | 
foo? 2 Gray =% (8.10-52) 
Ama)? wy 2H 
Also (? — ) —&—a,) + x a, + 1 (8.10-53) 


Using (8.10-49) and (8.10-51) to (8.10-53), we see that (8.10-50) becomes 
u?(n?S_ — 1) + 2un(nS, — 1) + (n— 1)n? =0 (8.10-54) 


where 7 = x — y. From this equation we see that, to solve the equation for 
the roots of d(y) = 0 knowledge of the a; is not needed. 

Now (8.10-54) is also a quadratic equation in 4 whose roots are contin- 
uous functions of the parameter n. Since p = 1 — x takes on only real values 
in our discussion, our aim is to find the maximum value of |7| such that p is 
real. Since this implies that greater values of || will result in complex 
values for y, that is, complex roots of (8.10-54), the desired value of |j| must 
be such that the discriminant D of the quadratic equation in p is 0 (why?). 
This means that it is only necessary to determine the values of 7 which make 
D=0. Thus, » or, what amounts to the same thing, 4 need not be 
determined at all. 

From (8.10-54) we find that 


D = n*{(S? — (n— 1)S,]n? — 2nS, +n} (8.10-55) 
Solving D = 0 for n, we get 
n 
1 Sr Jn 1)(n8, — 53) 


which yields the equation 


(8.10-56) 


_x,___f@) , 
ya Tae (8.10-57) 


where H(x) = (n — 1{(n — 1 F'"(x)]? — nf (x) f'"(x)} (8.10-58) 


382 A FIRST COURSE IN NUMERICAL ANALYSIS 


When all the zeros are real, we can show that H is always nonnegative 
{60}. Equation (8.10-57) suggests the iteration 


nf (x;) 
Xin = Xi (8.10-59) 
F'(xi) £ J H(%) 
In order to determine which sign to use in the denominator of (8.10-59), let 
us use Theorem 8.1 to determine the order of this iteration. We have, with a 
any simple real zero of f(z) {60} 
nf (x) 


(x) = x - 


f'(x) + /A(x) 


‘(a)=1— Ae) 
Ma)= 1 iG) & Ha) 
of) | 
— | F(a) ; (n ~ Ifa) (8.10 60) 


P(g) = 210") 


f(®)+- If] 
po OF @) 2 f'(0)_ 
*\— F@tin—Dife@l| * 2 Fal 


From (8.10-60) it is easy to see that if the sign is chosen to agree with the sign 
of f’(a), then both F’(a) and F"(«) are zero. Therefore, in practice we should 
choose the sign in (8.10-59) to agree with f’(x,). In particular we can then 
show that if the initial approximation x, <«,, then x; <x;4, <a, and, 
similarly, if a, < x ,, then «, < x;4, <x; {60}. Since it can be shown that 
F’" (a) # 0, Laguerre’s method is a third-order method for simple real zeros. 
This is obtained at the expense of calculating f(x), f’(x), and f"(x) at every 
stage of the iteration. 

The great advantage of Laguerre’s method is that the method is sure to 
converge independent of the initial approximation x,. From our construc- 
tion, this is obvious if x, €[a,,a,], and from the comments above, it is 
also true if x, is outside this interval. This method is then a powerful, rapidly 
converging method for a polynomial all of whose zeros are real and simple. 
If all the zeros are real but some are not simple, the method still converges 
but is first order in the neighborhood of a multiple zero. For polynomials 
some of whose zeros are complex little is known about the overall conver- 
gence properties of the method. Note that a real initial approximation may 
nevertheless converge to a complex zero, since H(x) can be negative in this 
case. It is known however, that when the method converges to a simple 
complex zero, the convergence is third order [Parlett (1964)]. Empirical 
evidence suggests that lack of convergence is extremely unusual. Laguerre’s 
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method is therefore a candidate for use as a general-purpose method for 
finding the zeros of polynomials. Its infrequent use for this purpose is due 
partially, at least, to an incomplete theoretical understanding of the 
method. 


Example 8.14 Use Laguerre’s method to find the largest positive zero of the polynomial 
of Example 8.12. 
To illustrate the power of this method, we choose a very bad initial approximation 
= 10°. We have 


f(x) = x3 — 5x? — 17x + 21 
f'(x) = 3x? — 10x — 17 
ff" (x) = 6x - 
Then using (8.10-58) and (8.10-59), we calculate 


i 2 3 4 
7.4785207 7.0001011 7,0000000 


and the true zero is 7. We note that if the magnitude of x, is large, the arithmetic at the first 
iteration must be done to high precision or substantial loss of significance results. 


8.11 THE JENKINS-TRAUB METHOD 


Consider the monic polynomial 


Pp 
P(z)= 2" +a,-12" } +++ +a,2 +a = [](z-—2,)™ (8.11-1) 
j=l 
where the a; are complex numbers with ay # 0 and where the zero a, of P(z) 
has multiplicity m,, j = 1, ..., p, so that )?_, m; =n. We then have that 


z)= y m, Pz) (8.11-2) 
where P,(z) = #0) j=1,...,p (8.11-3) 


We shall now generate a sequence of polynomials H(z) starting with 
H(z) = P’(z), each of the form 


H(z -y cP (z (8.11-4) 
j=l 
with ce) = m;, j= 1, ..., p, If we can choose such a sequence so that 
H(z) > cP, (z), that is, so that the ratios 
ol | 
di” = >-0 j=2,...,p (8.11-5) 


384 A FIRST COURSE IN NUMERICAL ANALYSIS 


then we can find a sequence {t,} of approximations to «, by the formula 
P(s,) 


tas = 5a FFAS) (8.11-6) 
(a) 
where H(z) _H") (8.11-7) 
ey 
j= 


is monic and {s,} is an arbitrary sequence of complex numbers. This is so 
because, using (8.11-3) to (8.11-5), 


P 
P(s,) > ct) 
Oh = S,; — i 
cf") P(s,) 
j=l 
Pp 
P,(s,)(s, — ay) (1 4 ya") 
j=2 
= $3, — j 


Pp 
P, (sje | 1+ Sav Pos) Pils.) 
j=2 


which approaches s, — (s, — «,) = @. 
The H(z) are generated by the formula 


H?*)(z) = — Q(z) (8.11-8) 
where Q(z) = H(z) a P(z) (8.11-9) 


We can generate such a sequence so long as P(s,) # 0. Otherwise, of course, 
s, is already a zero of P(z) and we can deflate P(z) and start afresh. Now 
using (8.11-3), (8.11-4), and (8.11-9) in (8.11-8), we find that 


Pp (A) p (A) P 
H@+(z) = P(z) ( Cj Cj _ c+ DP (2) 


zZ—S) j=1 2% — j=1 5g — % j=l (8.11 10) 
(A) 
C; 
where Bt) = 4 j=l,...,p (8.11-11) 
Hence 
-1 
cath) — cy” _ cy a mj 
J a—S a, — S,)\(H;,— S,_ 4 
J a ( J al J a 1) IT; -s) 
t= 


j=l,...,p — (8.11-12) 


and if no s, is a zero of P(z), c\” #0 for all j. 
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We now discuss the choice of values of s,. Since, as we noted in 
Sec. 8.10, deflation is most stable when the zeros are generated in order of 
increasing magnitude, we are interested in converging to a zero of smallest 
magnitude. Hence, in the first stage, we choose s, = 0 for A =0,..., M — 1, 
so that the coefficients of large zeros will be small while those corresponding 
to small zeros will be accentuated. This happens with s, = 0 because with 
such a choice 


dim = (2 joes (8.11-13) 


m,\ a, 


Now, were there a zero «, such that 
lar | < |o.| < a3] <-+- < Ja] (8.11-14) 


we could continue with s, = 0 and the sequence {t,} would converge to a), 
with the rate of convergence determined by the size of the ratio |«, /a2|. 
However, (8.11-14) does not always hold, and even when it does hold, 
|x, /a,| may be very close to unity; ie. we may have a cluster of zeros. 
Therefore, we choose s, = O for a fixed number M of iterations and then 
subsequently handle the root-cluster problem. In practice, M is taken to be 5 
on the basis of numerical experience. 

In the second stage, we are interested in separating zeros of equal or 
almost equal magnitude. To this end we iterate with s, = s, a complex 
number which we hope will be closer to one zero than to any other. In the 
sequel, we shall call this zero a, even though it is not necessarily the zero 
smallest in magnitude. However, its magnitude will not differ much from the 
minimum, so that except in most unusual circumstances deflation will still 
be stable. With such a choice of s, we find that 


M/, __ .\L-M 
ap = (2) (= ‘ j=2Q.sp (811-15) 


— S$ 


where A= M, M + 1,..., L— 1. Hence, if Ja, —s| < |a;—s]|, d’’—-0 as 
Lo, j=2,..., p, so that t, > a. 
As an initial estimate for s we choose 


s = eB (8.11-16) 


where @ is an angle chosen at random and P is the unique positive zero of the 
polynomial 


z"+ |a,-,|(2" '+-°°+ |a,|z— |ao| (8.11-17) 


By a theorem of Cauchy [Householder (1970), p. 73, example 5] f is a lower 
bound on the |«;|, j= 1, ..., p. @ is theoretically taken at random since we 
have no prior knowledge about the location of the zeros in the complex 
plane and with probability 1 it will yield a value of s closer to one zero than 
to any other zero. This way of choosing 6 could be implemented in practice 
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with a random-number generator. However, the standard implementation 
takes @ initially to be 49°, which gives a value close to the middle of the first 
quadrant of the complex plane. With this value of s, we could usually iterate 
to convergence. However, we would again have linear convergence with the 
rate of convergence determined by the value of the ratio |(«, — s)/(a, — s)], 
where «, is the zero next closest to s (cf. Fig. 8.6). Hence we iterate only 
L— M times, until the corresponding t, is sufficiently close to «,. In theory, 
L depends on the distribution of the zeros of P(z), as we shall see in Theorem 
8.6. In practice, L is determined when 


Jt, —ta_-1| <3|t,-1| and \t,-1 —t,-2| <3]t,-2| (8.11-18) 


It may happen that by some misfortune s is equidistant or almost so from 
two or more zeros. In this case we switch to a new value of s if (8.11-18) does 
not hold after some maximum number of iterations; i.e., we choose a new 
value of 6 in (8.11-16) either by generating a new random number or, in the 
standard implementation, by increasing the previous value of 0 by 94°, 
which steps us through the four quadrants of the complex plane if necessary 
and then repeats the cycle at a point 16° away from the previous point in the 
same quadrant. We then restart with 2 = M. 

Once (8.11-18) is satisfied, we enter the third stage and use a variable 
shift,s, =t,,A= L,L + 1,..., since it is now assumed that t, is a reasonably 
good approximation to a,. As we shall shortly see, this gives us better than 
quadratic convergence in the sense that the error coefficient also goes to 
zero. Before proceeding with the proof of convergence, we shall summarize 


Figure 8.6 Ordering of the zeros in the Jenkins-Traub method. 
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the algorithm. Set 
H(z) = P'(2) 
0 A=0,...,.M-1 
s, = <s [from (8.11-16)] A=M,...,L—1 
t, [from (8.11-6)) A=LL+1,... (8.11-19) 
H+ D(z) = — Q(z) Q(z) from (8.11-8) 


where M is usually taken to be 5, L is determined by when (8.11-18) holds, 
and convergence to «, takes place when | P(s,)| < €, where € is a measure of 
the roundoff error incurred in computing P(s,). 

We now state the theorem which ensures convergence if s, =t, 1s 
sufficiently close to «,. 


Theorem 8.4 If 


I. |s, —a,| <4R, where R = ming<;<, |%, — 4; 


2. D,= = jas | <4 


then the s, in the third-stage iteration (= t,) converge to «,. Further- 
more, if we define 


CU. [sp+aei 7 4 8 11-20 

(4) = svar oP ( ) 

then C(A) < = 140-02 (8.11-21) 

where = a ae (8.11-22) 
1-D, 


Thus, the process is second order with an error constant C(A) which 
approaches zero. 


ProoF Let us define 

S,— 
S,— 4; 

so that di4* 1) — rg (8.11-24) 


pO = j=l...,p (8.11-23) 
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Since s, = t, in stage 3, we have from (8.11-6) that 
P(s,) 


Sa+ 1 82 FGFS ) (8.11-25) 
Sy41 — Oy P(s,) 
SO that ____————— 1 rr ~ ree . - 
Sy, — % (s, — «,)H?* »'(s,) (8.11-26) 


Now using (8.11-3), (8.11-4), and (8.11-7) 
Pp 
(s, — a4) D) ef** PP ,(s,) 


(s, — a, )HO* Y(s,) = 4 
Y cht) 


a) 


Substituting this into (8.11-26) and using the fact that r(” = d? = 1, we 


obtain 
y( p)2 gH) p> plage) 
Pari M1 iF2 (8.11-27) 
S4— % 1+ y (n\) 2g) 
j=2 
By the first hypothesis, |s, — «| => |a, —a,;| — |s, —a,| > R/2, so 
that |r| <1, j;=2,..., p. If we now define 
T, = [Saea = | (8.11-28) 


|S, — | 
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it follows that 


Pp Pp 


dy lay] + Dd lay | 
T, < = —__,*+—__ =1, <1 (8.11-29) 
1 Play 
j=2 


Assume now that TJ, <1, for t=L,..., A— 1. Then since |s, — «,| = 
T,-1|5,-1 — &,|, it follows that 


|s, —a,| <ty"|s, —a,| <4R (8.11-30) 
and = |s,—a| = |a, —o,| — |s,-a,] >4R j=2,3,...,p 
(8.11-31) 


Hence |r| < 1, so that 
p 


p 
La |= d 
jr2 j 


j=2 


< D, 


A-1 
(L) (t) 
dj Ihr 
t= 


Therefore, by induction, T, <t, <1 for 2 >L, and since s, — a, = 
(s, —a,)]]fo2 T;, this proves convergence. Now, using (8.11-23), 
(8.11-24), and (8.11-27) with A replaced by 4 + L, we have that 


p pL tDge+D p qty 
Sp+ati— %1 — j=2 Srt+a 7 % j=25L4+4— (8.11-32) 
27 P ° 
(Si44— %) 24(L 
1+ ¥ (rit #9) gilt 
j j 
je2 


From (8.11-30) and (8.11-31) we have for all 2 and j > 1 {61} 


1 2 


pL+A|) < 2A Fe 
| ’ | " [Satz — %;| R 


(8.11-33) 
From this and (8.11-24) it follows that 
Pp Pp 
Blah = Beye PM se | Laie 


< HAWS |g) <beA-2— (8.11.34) 
j=2 
Substituting these bounds into the numerator of (8.11-32) and using 
hypothesis 2 for the denominator establishes (8.11-21) {61}. 

There is still one loose end in this proof, namely to show that the 
sequence {s,} is always well defined for A > L, that is, that H@*"(s,) + 0. 
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Now using (8.11-3) to (8.11-5), (8.11-7), and (8.11-11) gives {61} 


Pp 
_ » cf (a; — s,)” *Pis,) 
Het (s,) _ j=1 - 
X Pes)" 
f= 
14 
+ ¥ dar)? 
= P,(s,) ei (8.11-35) 
1+ 2 Ay) 


Since we assumed that s, is not a zero of P(z), P,(s,)= 
P(s,)/(s, — «,) # 0. Furthermore, as above, 


P 
dA (p)2 
2 J (r' ) 
Hence H“*)(s,) #0, and the third stage iteration is well defined so 
long as s, is not a zero of P(z). This concludes the proof of Theorem 8.4. 


<D,< (8.11-36) 


It remains to show the existence of a theoretical L for which the hypoth- 
eses of Theorem 8.4 are satisfied. 


Theorem 8.5 Let s in (8.11-16) be such that |s —a,| < |s—a,|,j =2, 
.., p. Then we can find an L such that hypotheses | and 2 of Theorem 
8.4 hold. 


PRooF From (8.11-15) 


= ¥ lap |= 


j=2 Mm, 


oy Ml gy, — s|b-M 


(8.11-37) 


a; 


OF —S 
If we fix M, then since the last term is less than 1, we can choose L 
sufficiently large to ensure that D, < 4 and that rt, is sufficiently small 


for |s — «,|t, <43R. As in (8.11-30), 
ls, — 4,| <t,|s—a,] <4R (8.11-38) 


which complete the proof. 


There is an interesting connection between the Jenkins-Traub method 
and Newton’s method, in that the formula (8.11-25) is identical with a single 
application of Newton’s method to the function W(z) = P(z)/H™(z) {62}. 
Thus, we can interpret the third-stage iteration as the application of 
Newton’s method to a sequence of rational functions W(z), which for 2 
sufficiently large are as close as desired to a linear polynomial with zero «,. 
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The Jenkins-Traub method has been generalized to find quadratic fac- 
tors of a real polynomial. This enables one to find the conjugate-complex 
zeros of a real polynomial using only real arithmetic. This makes for a faster 
algorithm at the cost of greater complexity. 


Example 8.15 Find the zeros of the polynomial 
P(z) = 2° — (13.999 + Si)z* + (74.99 + 55.998i)z° 
— (159.959 + 260.982i)z? + (1.95 + 463.934i)z 
+ (150 — 199.95i) 
= (z— 1 — i)?(z — 4 + 3i)(z — 4 — 3i)(z — 3.999 — 3i) 


using the Jenkins-Traub method. 

This polynomial has a zero of multiplicity 2 plus three almost equimodular zeros, 
two of which form a near-multiple pair. In the calculation of the zeros, M in stage | was set 
equal to 5. In the table we give the value of s used in stage 2, the number of steps in stage 2, 
L — M, the value of s, used to start stage 3, and the values s,,,, j= 1, 2, ..., until the 
stopping criterion was satisfied. 


J Si+j 
Zero(1): s = .27281 + .30819i L—M=2 Ss; = .99998 + 1.00003: 
1 .99999 99999 97 + .99999 99999 74: 


Zero(2): s = —.528885 + .41362i L—-M=2 s, = 1.00006 + 1.00034 


1 99999 99999 97 + 1.00000 00000 33: 

2 1.00000 00000 04 + 1.00000 00000 26: 
Zero(3): s = —.85450 — 1.24126 L-M=6 — s, = 5.78433 — 1.78155i 

I 4.15089 03638 40 — 3.60756 77541 67: 

2 3.98898 38448 95 — 3.00489 27129 73: 

3 4.00000 02599 31 — 2.99999 95584 78: 

4 4.00000 00000 00 — 3.00000 00000 00: 


Zero(4): s = 1.76985 — 1.06591i L—M=2 s, = 3.99950 + 3.00000i 


3.99949 91925 97 + 3.00000 08282 431 
3.99959 49473 61 + 3.00016 17681 68: 
4.00016 64468 29 + 3.00043 18443 90: 
4.00007 84238 10 + 3.00000 41370 51i 
4.00000 07928 07 + 2.99999 93289 691 
3.99999 99999 47 + 2.99999 99999 88 


A an & WN — 


Zero(5): 3.99900 00000 03 + 3.00000 00000 12i 
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8.12 A NEWTON-BASED METHOD 


The aim of this method 1s to find the zero of smallest magnitude of a given 
polynomial so that it can deflate in a stable fashion and thus find all the 
zeros. Since it is based on Newton’s method, the main problem is to find an 
approximation to this zero close enough to ensure convergence of Newton’s 
method. This problem itself is solved with the help of the Newton formula in 
that this formula is used to provide a direction of search for the next iterate 
rather than its value. This direction is usually a descent direction for the real 
function of a complex argument F(z) = | f(z)|, so that we obtain a sequence 
of points giving decreasing function magnitudes. This 1s called stage 1. Since 
searching along any particular direction is expensive with respect to function 
evaluations, we would like to leave stage 1 as soon as possible. Hence, when 
we determine that we are close enough to a zero of f(z) to ensure that 
Newton’s method converges, we enter stage 2 and use the standard Newton 
formula. The details for stage 1 are as follows for the complex case. The 
modifications needed in the real case are only in the deflation and will be 
given later. 

Given the polynomial f(z) of (8.9-1), we wish to generate a sequence {z,} 
which converges to the zero of smallest magnitude or, at least, to one close to 
it in magnitude. Successive points are related by the formula 


ZK+1 = Zk + Bdz, (8.12-1) 


where dz, is called the tentative step at iteration k and f is a scalar, possibly 
0. We distinguish between a successful step, in which z,,, # Z,, and an 
unsuccessful step, in which z,,, = z,.Ifthe previous iteration was successful, 
the Newton correction 


nh, = — f(a) (8.12-2) 
is computed and dz, is taken as 


Nn, if |n,| <3|z, — 2-1] 

= {ain leh (8.12-3) 
“hth kody otherwise 
| n, | 


where @ is chosen (arbitrarily) as tan~ ' 3. If the previous step was unsuccess- 


ful, then 
dz, = —4e'dz,_, (8.12-4) 
The motivation behind these choices of dz, is as follows. After a successful 


iteration, we normally want to take a step in the Newton direction, since this 
is usvally a descent direction for F(z). However, in the neighborhood of a 
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saddle point of F(z) which occurs at a zero of f'(z), the Newton direction is a 
worse direction than almost any other. Hence, if we see that |n,| is relatively 
large, we suspect that we are approaching such a point and therefore change 
the search direction by the amount 8, which seems to work well in practice. 
After an unsuccessful iteration, we want to change the search direction to 
one likely to be successful. Now, if f(z,) # 0, there certainly exists a descent 
direction for F(z) inasmuch as F(z) has local minima (which are also global) 
only at the zeros of f(z). Hence, since we search in a different direction each 
time, repeated use of (8.12-4) will certainly yield a descent direction. Further- 
more, for reasons connected with the termination strategy, it is desirable to 
reduce the step size, as we shall see. 
Once dz, has been chosen, f(z, + dz,) is computed and the inequality 


F(z, + dz,) < F(z,) (8.12-5) 


is tested. If it holds, the numbers F(z, + pdz,), p= 2, ..., n, are computed 
until the sequence is no longer strictly decreasing. Since this search is 
designed to locate multiple zeros, we stop at p = n. Otherwise, we could end 
up searching indefinitely along a particular direction with only marginal 
improvement each time. If (8.12-5) does not hold, the numbers F(z, + 4dz,), 
F(z, + 4dz,), and F(z, + 4e'dz,) are computed until the sequence ceases to 
decrease. The reasoning here is that if the values of F(z) decrease as we 
approach the point z,, then this point may be near a saddlepoint and we 
hope to do better by switching to another direction. In either case, B is 
chosen as the last value for which F(z, + Bdz,) is strictly less than the 
previous one in the sequence. If F(z, + dz,) > F(z,) and F(z, + 4dz,) > 
F(z, + dz,), we take B = 0, that is, z,,, = Z,, and we have an unsuccessful 
iteration. Note that if there is a true multiple zero of multiplicity m, then by 
(8.6-13), z, + mn, is the proper Newton step, which we shall usually find if 
we are close enough to that zero. A similar situation will hold if we are at a 
fair distance from a cluster of m zeros but nearer to it than to any other zero. 


Stage 1 is initialized by the following values: 


Zo = 0 
—f (0) 
dzo=! f'0) if f'(0) #0 8.12.6) 
l otherwise 
1/k\ dz 
zZ = min | “0 ) a 
“E> 0 ay | dzo| 


The choice of z, is such that its magnitude is less than that of any zero of f (z) 
[cf. Householder (1970), p. 73, example 11], and it is in the direction of 
steepest descent of | f(z)| from the origin {63}. It is therefore likely that we 
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shall converge to a zero of near-minimal magnitude. We wish to switch to 
stage 2 (in which we shall be using straight Newton iteration) when we are 
reasonably sure of convergence of the Newton iteration, i.e., we are not 
converging to a saddle point, and when we are not converging to a multiple 
zero (for then the Newton iteration gives us no advantage). These conditions 
require that we switch to stage 2 only when f£ = 1. The test for the former 
condition is based on the Kantorovich theorem [Ostrowski (1973), p. 59], 
which states that if Ky is the circle with center z, +n, and radius n,, the 
conditions 

Ff (2) f(z) #9 2 |, | max| f'(z)| < | f'(%)| (8.12-7) 

ZERO 

ensure the convergence of Newton’s method starting from z,. In practice 
(8.12-7) is replaced by the test of the inequality 


2| f(z) | | f'(Ze-1) —f' (2x) | < | f'(Z4) |? [21 — 2, | (8.12-8) 


which uses the crude approximation |[f'(z,-1) —S’(z)I/(z-1 — 2)| to 
max, <x, |f"(z)|. When (8.12-8) holds, we switch over to stage 2; in stage 2 
(8.12-8) is checked at every iteration, and if it does not hold, we switch back 
to stage 1. Similarly, if (8.12-5) does not hold at any iteration, we switch back 
to stage 1 with a tentative step given by (8.12-3). These tests are necessary 
here since we used (8.12-8) rather than (8.12-7) to make our decision to enter 
stage 2. Had we used (8.12-7), we would be certain that, barring roundoff 
errors, (8.12-7) and (8.12-5) would hold at every stage 2 iteration. However, 
because of roundoff, (8.12-5) need not hold in practice near a zero, as we 
shall see, and even had we used (8.12-7), we would still have to check (8.12-5) 
at every iteration. 

We can terminate the process with an approximation to a zero while in 
either stage 1 or stage 2. For termination, one of two criteria must be 
satisfied. Both involve ¢, the largest number such that, to machine accuracy, 
1 + ¢€ = 1. These criteria are 


€|Ze41| > |Ze+1 — Z| >0 (8.12-9) 
and F (2,41) = F(Z) < 16n|agle = 6 (8.12-10) 


The number 6 is a generous overestimate of the roundoff error made in 
calculating f(z) at the zero of smallest modulus; it is to be expected that such 
accuracy will be attainable. The normal convergence pattern is that F(z,) 
decreases until well below 6, and then roundoff errors cause a new iterate 
Z,41 to be equal to z,, so that (8.12-10) holds. If such accuracy is unattain- 
able and F(z,) does not decrease at each stage, we shall switch back to stage 
1. Then the step will decrease steadily because of (8.12-4) until (8.12-9) is 
satisfied. With this combination of convergence criteria, we are almost cer- 
tain to obtain the best possible solution, and usually at the cost of only one 
extra iteration. 
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If we have a real polynomial and have found a complex zero 
a; = x; + iy;, we must decide whether this is a true complex zero, in which 
case we deflate by the real quadratic factor (z — a,)(z — &,) using (8.10-9), or 
whether this is a simple real zero x;. To this end, we compute f(x,) and 
check whether the difference between the computed values of | f(x,)| and 
| f(«;)| is within the roundoff error in the computation of f(x;,). If so, we 
deflate with the linear factor z — x, using (8.10-1). 

A program based on this method has been compared with one based on 
the Jenkins-Traub method and was found to be from 2 to 4 times as fast and 
at least as accurate. An additional feature incorporated in this program is that 
it computes an error bound for each zero separately. This error bound is the 
radius of the circle with center at the calculated zero in which the true zero of 
f(z) is almost certain to lie. The program and details are given in the refer- 
ence cited in the Bibliographic Notes. 


Example 8.16 Find the zeros of the polynomial given in Example 8.15 using the method of 
this section. 

The zeros «; were computed in the order given below. With each zero, we give m, the 
number of evaluations of f(z), and n, the number of evaluations of f’(z). We give the results 
to 12 figures. In this computation ¢ = 2.3 x 107 °°. 


OF m n 


99999 99920 93 + 1.00000 00008 2: 25 7 
1.00000 00079 1 + .99999 99991 77: 18 8 
3.99899 999999 + 3.00000 00000 0i 48 14 
4.00000 00000 0 — 3.00000 00000 Oi 21 9 
4.00000 00000 1 + 3.00000 00000 0i 


an & WN — 


If we compare these numbers with the results in Example 8.15, we see that the 
accuracy of the Jenkins-Traub method was slightly better for the multiple zeros and 
marginally worse for the others. On the surface, it would appear that the Jenkins-Traub 
method was faster for this problem too, since the numbers corresponding to m + n were 
respectively 12, 15, 29, and 27. However, there are more things involved in a computation 
than function evaluations, and, in fact, the measured computer time was almost half that 
using the Jenkins-Traub method. Furthermore, this ratio decreases as the degree of the 
polynomial increases, so that, for example, for the polynomial x*° + 1, the ratio was about 
4. Finally, the Jenkins-Traub program requires considerably more storage than this 
method. 


8.13 THE EFFECT OF COEFFICIENT ERRORS ON 
THE ROOTS; ILL-CONDITIONED POLYNOMIALS 


The coefficients of the polynomial whose zeros we actually compute are 
seldom the true coefficients of the polynomial whose zeros we desire. The 
coefficients we use may arise from empirical data, in which case we shall not 
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know the true coefficients, or they may result from a lengthy computation 
that introduced many rounding errors, in which case we may have very 
conservative bounds on the roundoff errors. And even when we know the 
true coefficients, we must round them when inserting them into the 
computer. How then do these coefficient errors affect the accuracy of 
the calculated zeros? 

Let the true polynomial be 


F(z)=A,2" + A, 12" '+°:°+A,z2+ Ag (8. 13-1) 
and define 
6; = A; — a; i=0,1,....n (8.13-2) 
If the computed zero is Z, and the true zero is 
Zo =Zt+e€ (8.13-3) 


where € may be real or complex, we wish to find an estimate of the magni- 
tude of «. Assume for now that ¢€ and the 6,’s are sufficiently small to permit 
all products of the errors to be neglected. Then substituting (8.13-2) and 
(8.13-3) into (8.13-1) and using the fact that zo is a root of (8.9-1), we get {66} 


6:26 + £'(Zo) #0 (8.13-4) 


Mh) = 


i=0 


Y, 6:20 
i=0 


Therefore, ia f(z) (8.13-5) 
One obvious limitation of this estimate occurs when f'(z,) is zero or 
small, in which case the previous assumption that products of errors could 
be neglected was unfounded {66}. We might expect, however, that when 
f'(2o) is not small and when the 6,’s are the result only of roundoff in 
entering the coefficients into the computer, |€| would indeed be small and 
(8.13-5) would give a good estimate. The following example, due to Wilkin- 
son (1959), indicates that this need not be so for polynomials of high degree. 
Consider the polynomial 


f(z) = (z + 1)(z + 2) ++: (z + 20) (8.13-6) 


with zeros —1, —2,..., —20. If z9 = —20, then | f'(zo)| = 19!. Suppose 
that 6, =0, i=0, 1, 2, ..., 17, 18, 20, but that 6,,=2~23 = 107”. Then 
(8.13-5) becomes 


10 7(20)19 
Jefe OY wa (8.13-7) 
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which is not small. In fact, correct to nine decimal places, the zeros of 
f(z) +27 ?%z!? are 


— 1.00000 0000 — 10.09526 6145 + 0.643500904i 

— 2.00000 0000 

— 3.00000 0000 — 11.79363 3881 + 1.65232 9728i 

— 4.00000 0000 

— 4.99999 9928 — 13.99235 8137 + 2.51883 0070i 

— 6.00000 6944 

— 6.99969 7234 — 16.73073 7466 + 2.81262 4894i 

— 8.00726 7603 

— 8.91725 0249 — 19.50243 9400 + 1.94033 0347: 
— 20.84690 8101 


For example, the zero corresponding to — 19 in f(z) has not only changed 
substantially but has become complex. It is therefore no surprise that 
(8.13-7) gives a poor result. The small changes in the zeros of small magni- 
tude suggest that (8.13-5) would give accurate estimates for these errors, and 
in fact this is correct {66}. 

A polynomial such as (8.13-6) in which a small change in a coefficient 
may cause a large change in one or more zeros is called ill-conditioned (cf. 
Sec. 1.7 and Sec. 9.5). If the coefficients are in fact known exactly so that the 
coefficient error is the result of roundoff in entering the coefficients into the 
computer, then by using multiple-precision arithmetic we can decrease 
this roundoff and increase the accuracy of the zeros. In fact, it is generally 
true that the solution of high-degree polynomial equations requires the use 
of multiple-precision floating-point arithmetic in order to achieve high 
accuracy. 


BIBLIOGRAPHIC NOTES 


Sections 8.1 to 8.7 The basis of much of these sections is the book by Traub (1964), which 
will not be referred to explicitly below. This book contains the best and most complete treat- 
ment available on functional iteration, including much that can be found nowhere else. It also 
contains an excellent bibliography. In some instances, our terminology differs somewhat from 
that of Traub. Another excellent book that we have used extensively is that by Ostrowski 
(1973). Other excellent general references for these sections and for the remainder of the chapter 
are Durand (1960-1961) and Householder (1970). See also Traub (1967). 


Section 8.3 The secant method and the method of false position are discussed in detail and 
with insight by Ostrowski (1973). Most standard texts in numerical analysis consider one or 
both of these methods. The modifications to regula falsi are studied by Anderson and Bjéorck 
(1973). 


Section 8.4 Much of this section can be found in the papers by Traub (1961a,b). Ostrowski 
(1973) discusses the Newton-Raphson method in detail. 
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Section 8.5 A widely used multipoint iteration method which is not discussed in this 
section is that of Muller (1956) (see Prob. 21). 


Section 8.7 Ostrowski (1973) has an interesting discussion of convergence. Wilkinson 
(1959) discusses the problems that arise in Example 8.6. The 5? process is due to Aitken (1926) 
and is discussed in a number of numerical analysis texts. For other acceleration procedures, see 
Ostrowski (1973). 


Section 8.8 The most comprehensive work on the solution of systems of nonlinear alge- 
braic equations is that of Ortega and Rheinboldt (1970). The third edition of Ostrowski (1973) 
contains much material not available elsewhere in book form. A concise survey of the field is 
given in the booklet by Rheinboldt (1974). Two recent books which contain proceedings of 
conferences devoted to the numerical solution of nonlinear algebraic equations are those edited 
by Rabinowitz (1970) and by Byrne and Hall (1973). Both books contain computer programs for 
the solution of systems of nonlinear equations; see also Rall (1969). Much of the material in this 
section is based on a paper by Broyden in the volume edited by Rabinowitz, where further 
details and references can be found. The methods discussed in Probs. 39 to 41 are due, respec- 
tively, to Ostrowski (1973), Wolfe (1959), and Kincaid (1961). 


Section 8.9 Marden (1966) is the most complete reference on the location of the zeros of 
polynomials. Wall (1948) contains a number of results in this area; Wilf (1962) contains some 
selected results. The material on Sturm sequences is mainly from Gantmacher (1959). Wilf 
(1960) discusses the general problem of computing the zeros of polynomials, while Peters and 
Wilkinson (1971) discuss the practical aspects of the problem. The proceedings of a conference 
exclusively devoted to the problem of finding the zeros of a polynomial appear in the volume 
edited by Dejon and Henrici (1969). 


Section 8.10 The convergence of the Newton-Raphson and secant methods in the complex 
case is given by Householder (1970). Hildebrand (1974) has a good discussion of Bairstow’s 
method. A method very similar to Bairstow’s is discussed by McAuley (1962). Graeffe’s and 
Bernoulli's methods are discussed in many numerical analysis texts; see, for example, Hildebrand 
(1974) and Householder (1953). Many authors have discussed these two methods. Hildebrand 
(1974) contains a number of references to these papers. The mechanization of Graeffe’s method 
for digital computation is due to Bareiss (1960, 1967). Durand (1960-1961) contains a good 
exposition of Laguerre’s method. 


Section 8.11 The Jenkins-Traub method is the culmination of a series of research works by 
Traub and Jenkins and appears in Jenkins and Traub (1970a). The computer implementation of 
this method for complex polynomials is given in Jenkins and Traub (1972). The corresponding 
references for the real case are Jenkins and Traub (1970b) and Jenkins (1975), respectively. A 
related method is given by Young and Gregory (1972). 


Section 8.12 The original idea of this method can be found in Madsen (1973). This idea 
was further developed and incorporated in a computer program including error estimates by 
Madsen and Reid (1975). An algorithm in the same spirit has been proposed by Moore (1976). 


Section 8.13 Wilkinson (1964) contains an excellent discussion of the computational 
problems that arise in the solution of high-degree polynomial equations when the polynomials 
are ill-conditioned. McCracken and Dorn (1964) give a discussion, similar to ours, of the 
estimation of errors. 
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PROBLEMS 


Section 8.2 


1 Let x = g(y) be the function inverse to y = f(x). 
(a) By using induction, show that we can write 


x) 


9) = Tye k= I, 2, ..: 
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where X, is a polynomial in y’, y”,..., y which satisfies the recurrence relation 


dX 
X12 


ne] “y’ — (2n — 1)X,y" X,=1 n=1,2,... 
dx 


(b) Use this result to find explicit expressions for g“(y) for k = 1, 2, 3. [Ref.: Ostrowski 
(1973), pp. 20-22.] 

2 (a) Show that the convergence of a functional iteration method of order 1 implies that 
the asymptotic error constant is less than or equal to | but that for methods of order greater 
than | this need not be so. Why is the “or equal” part needed above? 

(b) May functional iteration methods have order less than 1 and still converge? Why? 


3 Solve the difference equations (8.2-11) and then use (8.2-12) to derive (8.2-13). 


4 (a) On a digital computer, let the time required, i.e., the cost, to compute a multiplica- 
tion or division be 1, a square root be 3, and any elementary function be 6. Ignoring the cost of 
additions and subtractions, what are the costs of evaluating f(x) and f’(x) when f(x) = 
e* cos? x + In x tan x? Use this result to explain why it is generally quite inexpensive to 
compute f‘(x) once f(x) has been computed. 

(b) If f(x) is a polynomial, what are the relative costs of computing f(x) and f'(x) if the 
algorithm (7.2-4) is used to evaluate a polynomial? 


Section 8.3 


5 (a) Prove that the sequence of iterates in the method of false position approaches a limit 
and that this limit 1s a solution of (8.1-1). 

(b) Show that if the conditions (i) f”(x) # 0 in [x,, x2]; (ii) f(x,)f"(x,) > 0 are satisfied, 
then x, always remains one of the points in the false-position iteration. These conditions are 
called Fourier’s conditions. 

(c) Therefore, deduce that a sufficient condition for the method of false position to be a 
one-point iteration method is for f(x) to be convex between f(x,) and f(x,). 

6 (a) Solve the difference equation (8.3-18) to get the solution (8.3-20). 

(b) Thus deduce the plausibility of (8.3-23). 

7 (a) Calculate the smallest positive root of cos x — xe* = 0 using (1) the secant method, 
(ii) the method of false position, (iii) the Illinois method, (iv) the Pegasus method with x, = 0, 
x, =. 

(b) Calculate the smallest positive root of tan x — cos x = 5 using the secant method. 

(c) Use the secant method to find the root of 


e=2 z=x+ly 


with smallest positive imaginary part by eliminating x between the two equations for the real 
and imaginary parts of e? = z. 


Section 8.4 


8 (a) Use the result of Prob. | to derive (8.4-9). 

(b) Display the iteration formula (8.4-7} when m = 1 in terms of f and its derivatives. 

(c) What is the efficiency index of this method? When would it tend to be a better method 
to use than the Newton-Raphson method? 


9 Halley’s method. Consider the iteration function 


where Q is a polynomial. 
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(a) If Q is linear, derive a method of this form of order 3. This is called Halley’s method. 

(b) Use Halley’s method and the method derived in Prob. 8b to compute ,/2 starting 
from x, = 2. 

10 (a) Derive the Newton-Raphson method by expanding f(x,,,) in a Taylor series 
about x, and ignoring all but the first two terms. 

(b) Use the Taylor-series expansion of f(«) about x, to derive (8.4-16) directly. 


11 (a) Use (8.3-23) and (8.4-16) to show that if the two iteration functions in (8.2-9) are 
the secant method and the Newton-Raphson method, respectively, then C, = CY’Y°~ 12, 

(6) Use this result in (8.2-13) to show that the Newton-Raphson method has a lower 
computational efficiency than the secant method unless the cost of evaluating f’(x) is less than 
.44 times the cost of evaluating f(x). Is this the same result that would have been obtained using 
(8.2-15)? 

12 (a) Find the smallest positive root of the equation of Prob. 7a using (i) Newton’s 
method (ii) Halley’s method (see Prob. 9). Use x, =0; compare the results with those of 
Prob. 7a for speed of convergence. 

(b) Repeat the calculations of Prob. 7b using Newton’s and Halley’s method with x, = 0. 


13 (a) Use Newton’s and Halley’s method to compute (10)'/° starting with x, = 10. 

(b) Some computers do not have the operation of division built into them. Newton’s 
method can be used to find the reciprocal of a number without doing any divisions. Use this 
technique to calculate 7p starting with x, = .001. Can Halley’s method (Prob. 9) be used sim- 
ilarly to find a reciprocal? Why? 

(c) Try to use the Newton-Raphson method to compute ;4 starting with x, = 1.0. Explain 
the behavior. 

14 If F(x,) is an iteration function of order p, under what conditions on the function U(x;,) 
will 

G(x;) = F(x;) + U(x,)u’ r>p 


be an iteration function of order p? 


15 Consider the iteration formula 


Xi44 = X, — Of (x;) 


where c may depend on f and x,. Deduce that this method has linear convergence unless 
c= 1/f'(x,). 
16 Consider the iteration 
y =x, - Le x _, — 1) 
© f'(x,) wv f(x) 


This is the Newton-Raphson iteration with the derivative computed only every second step. 


(a) Show that if the iteration converges, 


Xia1 — ea) 


im (y, ~ a)(x, ~~ ax) 7 f'(«) 


(b) Thus deduce that 


lm “ttt ! anil 
im ————. = = 
17 00 (x, ~~ a)? 2 f'(«) 

(c) If the cost of computing f(x) is 1 and f'(x) is @,, for what values of 0, is this method 
more efficient than (i) the Newton-Raphson method; (ii) the secant method? 

(d) Use this method to repeat the calculation of Prob. 7a with x, = 0. 
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Section 8.5 


17 (a) Derive (8.5-1). 

(b) Find the order of an iteration function derived from (8.5-2) with n = 3 and r, = 0 for 
all j. 

(c) Repeat part (b) with n =2 andr, =r, = 1. 

18 (a) Derive the form of the three-point iteration function whose order was calculated in 
Prob. 17b. 

(b) Similarly, derive the two-point iteration function whose order was calculated in 
Prob. 17c. 

(c) Use the methods of parts (a) and (b) to repeat the calculation of Prob. 7a using 
suitable starting values. 


*19 Iteration functions by direct interpolation. Let y(x) be the interpolation polynomial of 
(8.5-1). If the equation y(x;, ,) = Ocan be solved for x; , ,, then this defines an iteration function. 
(a) Suppose all the points x,_ pj =0,..., 0, lie in an interval J which contains the root a 
and in which f‘(x) + 0. If there is at least one x,_ , on each side of a, show that y(x,,,) =Ohasa 
real solution in J. Need this solution be unique? (Even when all the points lie on one side of «, 
there is such a real solution in general.) 
(b) Derive the following formula for the error 


_ fering, ll yt 
i+i-j 


oan (B + 1)!y'(n) jan 


where v lies in the interval determined by x,,, and a. [Ref.: Traub (1964), pp. 67-75.] 


20 (a) Derive the secant method using direct interpolation. 
(b) Derive the Newton-Raphson method using direct interpolation Show that this deriva- 
tion leads directly to (8.4-16) for the error. 


*21 Muller’s method. (a) Use a three-point Lagrangian interpolation formula and direct 
interpolation to derive the iteration formula 


— 2.5; 


where A, = (x; — x;-;)/(%;-1 —Xi-2) 6;=14+4;, ¢;=fj-24? —fi-167 + f(A; + 6;). Why 
should the sign in the denominator be chosen to give the denominator the greatest magnitude? 
(b) Use part (b) of Prob. 19 to deduce that ¢,4, = —€ 6-16-21 f "(€)/y'(n)}- 
(c) Proceed as in Sec. 8.3 to deduce the order of Muller’s method. 
(d) Use this method to repeat the calculation of Prob. 7a. [Ref.: Muller (1956).] 


22 (a) Verify that substitution of (8.5-13) into the Newton-Raphson method gives the 
secant method. 

(b) Verify Eqs. (8.5-15) and (8.5-16). 

(c) Show that the order of (8.5-17) is given by the result of Prob. 17c. 


Xin. =X; + (x; — X)_ 1) 


*23 (a) Derive a new iteration method from the Newton-Raphson method by replacing 
f'(x;) by its approximation found by differentiating a three-point Lagrangian interpolation 
formula based on x,, x;_,, and x;_,. 

(b) Use reasoning similar to that in Sec. 8.5-2 to show that the order of this method is 
1.84. 

(c) Repeat the calculation of Prob. 7a using this method. What advantage does this 
method have over Muller’s method (Prob. 21)? Any disadvantages? 
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Section 8.6 


24 (a) Verify (8.6-5) and (8.6-6). (b) Then derive (8.6-7) and (8.6-9). (c) Derive (8.6-11) and 
from this deduce (8.6-10). (d) Finally deduce that the order of (8.4-7) is 1 if is not a simple root. 

25 (a) Approximately how fast will the error in the Newton-Raphson method decrease 
from one iterate to the next in the neighborhood of a root of multiplicity r > 1? 

(b) Test this conclusion on the equation (cos x — xe*)* = 0 and compare with the results 
of Prob. 7a. Use x, = 0. 

26 Apply the iterations of (a) Eq. (8.6-23) with x9 = 0, x; = 1 and (b) Eq. (8.6-24) with 
x, = 0 to the equation in Prob. 25b and compare with the previous results. 


Section 8.7 


27 (a) Why can’t the method of false position be used to find a root of even multiplicity? 

(b) If a is a root of even multiplicity, show that the secant method may diverge no matter 
how close to @ the initial approximations x, and x, are. 

28 Prove that the iteration of Example 8.5 would have converged if 0 < x, < 1. 


29 (a) Why is it reasonable to test the convergence of an iteration by considering 
[Xin1 — 5]? 

(b) If the relative rather than the absolute error in the result is of interest, what would be a 
good quantity to use instead of |x,;,, — x,| to test for convergence? 

30 (a) Show that the iteration in Example 8.6 eventually converges. 

(b) Calculate the positive root of x?° — 1 = 0 using the Newton-Raphson method with 
x, =4 and the rule that if |x;,,/x,;| > 3, then, in place of the computed x; , ,, use x;,, = +3%,, 
where the sign is chosen to agree with x,, , /x;. 

31 Suppose the Newton-Raphson method is converging slowly, thereby indicating the 
presence of a multiple root. 

(a) Show how (8.7-4) can be used in conjunction with (8.6-10) to get an estimate of the 
multiplicity. [Since the multiplicity is an integer, this estimate amounts to a determination of the 
multiplicity. When the multiplicity has been found, (8.6-13) can then be used to get rapid 
convergence. | 

(b) Apply this technique to the equation of Prob. 25b. 


Section 8.8 


32 Let J = (af,/dx™) and K = (6g,/éy'”) be the Jacobian matrices of the vector function 
y = f(x) and its inverse function x = g(y), respectively. By differentiating the identity x = g(f(x)) 
with respect to each component x), k = 1,..., n, show that J = KJ and hence that (8.8-9) and 
(8.8-10) are equivalent. 

*33 (a) Use the implicit-function theorem to derive (8.8-12). 

(b) Derive (8.8-12) by expanding f,[x"), x] and f,[x", x] in two Taylor series about 
the root [a!, «'?)], setting f,[a!, o'?] = fifa', «!?] = 0, and dropping all derivative terms of 
order higher than 1. 

(c) Derive the error term in (8.8-12) by using Eq. (8.8-7). 

(2) By analogy with (8.2-8), how would you define order for iteration methods for simul- 
taneous equations? Use this definition to show that (8.8-12) has order 2 at a simple root. 

34 Find a root of the equation of Prob. 7c by solving the two equations for the real and 
imaginary parts using (8.8-12). Use x4? = .2, x? = 1.1. 

35 Consider the two simultaneous equations 


42.25x? + 27.885x — .749y? — 2.54y — 2.466 = 0 
~ .052x — .0192 + .00359y? + .00356y = 0 
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(a) Attempt to solve these two equations using (8.8-12) with x, = —.01, y, = .01. Carry 
out 15 iterations. 

(b) Now use x, = .3, y, = 2.5 in (8.8-12). 

36 Verify that VF(x) = 2J7(x)f(x) by differentiating (8.8-17) with respect to each compon- 
ent x of x, j= 1,..., n. 

37 A real symmetric matrix S is said to be positive semidefinite if for all real vectors x, (x, 
Sx) > 0. It is well known that for any such matrix S there exists a matrix R with R’R = I, the 
identity matrix, such that R’SR = D = diag(d,, ..., d,), a diagonal matrix with diagonal ele- 
ments d; > 0. 

(a) Show that for any real matrix A, A’A is positive semidefinite. 

(b) Show that if A> 0, then the inverse of B(A) = A™A + Al exists and is given by 
R[D(A)]"'R7, where R is such that R7(A7A)R = D and D(A) = diag (d, + A, ..., d, + A). 

(c) Show that (H(A)f, H(A)f) is a decreasing function of 4, where H(A) = B(A)~!A7, and 
consequently that in the Levenberg-Marquardt algorithm, the step length ||x,,,—x,||, 
decreases as A, increases. 

38 Consider the system of nonlinear equations (8.8-2) and an initial approximation 
x = X,. Define 


g(x, t)=f(x)—e f(x.) O<t<o 


so that g(x, 0) = 0 and g(x, t)- f(x) as t+ ©. 
(a) Show that if we define x(t) to be the solution of g(x(t), t) = 0, then x(z) satisfies the 
differential equation 


J(x)x'(t)= -f(x)  x(0)=x, O<t<o 


where J(x) is the Jacobian matrix of the vector function f(x). 

(b) Show that the application of the Newton-Raphson iteration (8.8-10) to (8.8-2) is 
equivalent to the solution of the above system of ordinary differential equations by Euler’s 
method (5.5-13) with h = 1. [Ref.: Boggs (1971).] 

*39 (a) Given three approximations (x;,, y;), (x;-1, Vi-1), (X%i-2, Yi-2) to the solution of 
f(x, y) = 0, f(x, y) = 0, find the equations of two planes z = L,(x, y) and z = L,(x, y) such 
that LAxj-4, Yi-«) =fi(Xi-k, Yi-e) J= 1,2;1=0, 1,2. 

(b) Calculate the next approximation (x;,,, y;4,) to be the intersection of z = L,(x, y), 
z = L,(x, y) and z = 0. When will this procedure fail? 

(c) Use this method to solve the equation of Prob. 7c by considering the two equations 
for the real and imaginary parts. Use as initial points (.4, 1.4), (.2, 1.4), and (.3, 1.1) and do three 
iterations. Will this method always converge? Is there a two-dimensional analog of the method 
of false position which will always converge? [Ref.: Ostrowski (1973), pp. 294-295.] 

*40 Let x;,X,_,,-.., X;-, ben + | approximations to the solution of (8.8-2) and let m),..., 
n, be such that 


(a) Show that when n = |, this iteration method is the secant method. 

(b) Let « be the solution of (8.8-2) and G,(x) and 2Q,(x), respectively, the vector of first 
partial derivatives of f,(x) and the matrix of second partial derivatives of f,(x). Use the first two 
nonzero terms of the Taylor-series expansion of f,(x;,,) about @ to get the approximation 


SAXi41) © DmGe) X;-j — a) + (X44 - 0)7Q, (X44 — a) 
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(c) Use the first two nonzero terms of the Taylor-series expansion of f,(x;_ ;) to eliminate 
G,(a) in the above approximation and thereby derive 


SdXia1) & — 2 mis — X;— j)7O,(@)(Xj41 — X;_ ;) 


Can you infer from this that this method has quadratic convergence? Why? 
(d) Use this method to repeat the calculation of Prob. 7c. Use the starting values of 
Prob. 39. At every step replace that x;_, with x;,, for which 


x | fix; -j) |? 


is maximal. Why is this latter a reasonable rule? [Ref.: Wolfe (1959).] 


*41 Consider the same two equations as in Prob. 39. Let f,(x, y) = —f,(x, y)—S,(x, y). 
(a) Let R,, S,, and T, be three points in the xy plane which are the initial approximations 
to the solution of the pair of equations. Let 


Ss =R, fi, S; 


where R, f, S, denotes the intersection with the xy plane of the line joining f,(R,) and f,(S,). 
Similarly, let 


T, =R, fi T, T, = Sf. Ti R, = 8, f,R, R,= T, fy Ri 5S, = T, f3 5S; 


Show that if this process leads to a solution of the system, then the points R,, S,, T, form a 
sequence of nearly similar triangles of decreasing size. 

(b) Use this method to repeat the computation of Prob. 7c. Use the starting values of 
Prob. 39. [Ref.: Kincaid (1961).] 


42 Verify that if A~* exists and that if B= A — (Aq — v)q’/q’q, then 


Bo! =aA7!-— (Av ty —q)q7A7! 
Tal 
q A ‘v 
provided only that q and q’A~‘v+0. Hint: For any scalar s and matrices C and D, 
C(sD) = s(CD). 


Section 8.9 


43 Derive (8.9-6). 

44 (a) If a zero of f(x) in [a, b] has been found, how can the Sturm sequence (8.9-5) be 
used to find the multiplicity of the zero? 

(b) Prove that part of Theorem 8.3 relating to the case where f(a) or f(b) or both is zero. 
What happens when the zero is not simple? 


45 If {f,(x)}, i= 1, ..., m, is a Sturm sequence, a generalized Sturm sequence is any se- 
quence {p(x) f;(z)}, where p(x) is an arbitrary polynomial. 

(a) Does Theorem 8.2 also hold for generalized Sturm sequences? 

(b) If f(x) and g(x) are two polynomials, use the Euclidean algorithm in a fashion analo- 
gous to (8.9-5) to generate a sequence of polynomials using f,(x) = f(x) and f,(x) = g(x). Show 
that this sequence is either a Sturm sequence or a generalized Sturm sequence. Thus deduce a 
method for finding the Cauchy index of any rational function. 


46 (a) Determine the number of positive and negative real zeros of 
z* — 3z3 — 54z* — 150z — 100. 
(b) Determine the number of real zeros greater than 1 of z* — 10z° + 34z? — 50z + 25. 


THE SOLUTION OF NONLINEAR EQUATIONS 407 
Section 8.10 


47 (a) Verify that the following algorithm computes the value of the polynomial 
P(z) = }7.9 a,-,2' and its first p reduced derivatives at the point z = x, p < n, using (p + 1) x 
(2n — p)/2 additions, 2n — 1 multiplications, and p divisions. 


(b) How many storage locations are required for this algorithm and how many for p + 1 
applications of synthetic division? [Ref.: Shaw and Traub (1974)}. 


48 (a) Deriveasynthetic-division algorithm analogous to those in Sec. 8.10 for the division 
of a polynomial of degree n by a polynomial of degree j < n. 

(b) Use this algorithm to determine whether z° — z* — 2z° + 2z? + z — 1 hasa triple zero 
at (i) z = 1; (ii) z= —1. Find all the zeros of this polynomial. 

49 The polynomial P(x) = x* + 9813.18x? + 8571.08x + .781736 has the zeros 
— 9812.31, —.873412, and —.00009 12157. 

(a) Assume that — 9812.30 has been accepted as a zero of P,(x). Deflate P(x) using this 
value in (8.10-1) and then use the quadratic formula to determine the remaining zeros of the 
deflated polynomial. 

(b) Do the same with the values —.873413 and —.00009 12156 


50 Backward deflation. (a) Show that if f(z) given by (8.9-1) is written in the form 
Ay + @,Z +°** + a,2",1t can be deflated by a zero « by dividing through by —a + z to yield the 
deflated polynomial gy + q,z+°*: + 4,-,2" ', where the q,; are given by the recursion 


Ag = — Go a, = —aq, + q,- r=1,...,.n-—1 


(b) Apply backward deflation to the polynomial of the previous problem using each of the 
three accepted zeros given there and determine in each case the remaining zeros of the deflated 
polynomial. (Ref.: Peters and Wilkinson (1971)}. 


51 (a) Show that if the zeros of f(z) given by (8.9-1) with ap #0 are o,,..., a,, then the 
zeros of the polynomial g(z) = a) z" + a,z""' +--+ +a,_,z +a, area;',...,a, 3. 

(b) Let the polynomial resulting from the deflation of g(z) by a zero a ' be 
dyz" 1 +-:-+d,_,2+4d,_ 1, where the d, are given as in (8.10-1) by the recursion 


dy = a d, =a, +a7"d,_, k=1,...,n—1 


Show that these are the same coefficients that would result if we performed backward deflation 
by dividing through by 1 — x/a instead of by —a + x. Hence, conclude that the reason back- 
ward deflation is stable if we deflate with the zero of largest magnitude is that we are essentially 
performing deflation of g(z) with the zero of smallest magnitude. 


52 Derive the equations analogous to (8.10-11) to (8.10-14) for the secant method. 


53 Do the calculation of Example 8.11 with an initial approximation p, = —1,q, = —1. 
Would you expect the convergence of the Bairstow iteration to be very sensitive to the initial 
approximation? Why? 


408 A FIRST COURSE IN NUMERICAL ANALYSIS 


, 34 (a) Use (8.10-11) to (8.10-14) to find the zero of maximum magnitude of z* + 223 + 
32* + 42 + 5. Use .3 + 1.4 as the initial approximation. 
(b) Repeat this calculation using the equations derived in Prob. 52. 


55 (a) Verify (8.10-29). 
(b) Use (8.10-30), (8.10-31), and (8.10-33) to show that 


ay” = (—1¥ a Y (Ox,Pu,“** Pay)” exp [i2°(G,, +++ + OI 
1 


(c) Use this to deduce (8.10-40) when (8.10-34) holds. 


56 Let the polynomial f(z) of degree n have zeros z,;, i= 1,..., n. 
(a) Show that if z is sufficiently large, then 


(z—2)°'=z°' +2277 + 222-3 4-- 
n 


n 
and, thus, dv (z—2)°' = nz) +5,27>?2 +5,2°9 +--+ where s,= ai. 


t=1 i=} 


(b) Show that 
fle) ¥e~ a) =s') 


(c) Thus deduce Newton's identities 
AnSm + An—1Sm—1 + °°* + Ay m4 15, + ma,_,, = 0 m=1,...,n 
DnSn+j t An 1Sp4j-1 t 1° + os; =0 j=1,2,... 


(d) From this deduce that, if (8.10-45) is used to generate the starting values in Bernoulli’s 
method, then all the c,’s are unity. 

57 (a) How must (8.10-42) be modified when some of the roots of (8.9-1) are multiple? 

(b) Nevertheless, show that if «, is multiple but real, then (8.10-44) still holds. 

58 (a) In Bernoulli’s method, define A, = u,_,/u,so that lim,.,, (1/A,) =4a,. Show, by 
rewriting (8.10-41) that 


1 
“T= by + Dy aAgig + Oya Ag- agi Ho + Do(Ageng a 7 Age 1) 
k 


(b) Show that for k <n, (8.10-51) can be expressed in terms of the /,’s. 
(c) Show that for k > n, A, can be computed using the recurrence relation 
Yeo = Bo 
Ver = Veor-1Ak-n+r +5, r=1,....n-1 
| 


A= - 


Vk, n-t 


59 Use Bernoulli's method to find the zero of maximum magnitude of the polynomial of 
Prob. 54. Use the result of Prob. 54 to explain the slow convergence (you will need to calculate 
at least to uj in order to stabilize the first decimal of 8 ,). Then calculate the remaining zeros of 
the polynomial. 

*60 (a) Verify (8.10-51) to (8.10-54). 
(b) Show that, if all the zeros of f(z) are real, then H is always nonnegative. 
(c) Verify (8.10-60). 
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(d) Show that if the sign is chosen in (8.10-59) to agree with f'(x;), then if x, <a, 
Xi< Xj4. <a, and ifa, <x,,0, < X41 <X,. 

(ec) Use Laguerre’s method with x, = 10° to find a zero of (i) z* + 8.123 — 19.82? — 
5.9z + 21; (ii) 2* + 22° + 327+ 4245. 


Section 8.11 


61 (a) Verify (8.11-33) to (8.11-36). 

(b) Use (8.11-33) and (8.11-34) to verify (8.11-21). 

62 Let V(z) = H(z)/P(z), where P(z) is given by (8.11-1) and H(z) by (8.11-8) and 
(8.11-9). 

(a) Show that V?*)(z) = [V(z) — V(s,)]/(z — s,), where s, is the variable shift given 
by t, of (8.11-6). 

(b) Show that V?*(5,) = [V%(s,)]' and that AG*) = —V(s,), where A is the 
coefficient of z”~' in H(z). 

(c) Show that t,,, =s, + V%s,)/[V%(s,)] = s, — W(s,)/[Ws,)]!, where W(z) = 
1/V(z) = P(z)/H™(z), and consequently, that formula (8.11-6) is precisely one Newton itera- 
tion step performed on the rational function P(z)/H)(z). Note that for A sufficiently large, this 
function is close to a linear polynomial whose zero is «,. 


Section 8.12 


63 Consider the function f(z) = g(x, y) + ih (x, y), where z = x + iy. 
(a) Show that f"(z) = g,(x, y) + h,(x, y) + i(h,(x, y) — g,(x, y)). 
(b) Define f(z)/f'(z) = G(x, y) + iH(x, y) and | f(z)| = /97(x, y) + h7(x, y) = F(x, y). 


Use the Cauchy-Riemann equations 


g(x, y)=h(x, y) g(x, y) = —h,(x, y) 


to show that ~ H(x, y)/G(x, y) = —F,(x, y)/F,,(x, y), which is the direction of steepest descent of 
| f(z)| at the point z = (x, y). 

64 Use any method or combination of methods to find the roots of (a) z> — 2z — 5 = 0; 
(b) z2>~—1627+3=0; (c) z*—3z74+2z—1=0; (d) 24+4+427—3z-—1=0; (e) 
z® + 523+ 727+1=0. 

65 Use any method to find the root of largest magnitude of (a) z> — 20z? — 3z + 18 =0; 
(b) z* — 3z* — 60z? + 150z + 300 = 0; (c) 10z*° — 212? — 40z + 84 =0. 


Section 8.13 


66 (a) Derive (8.13-4) under the assumption that products of errors can be neglected. 

(b) Why is the assumption that products of errors can be neglected generally unfounded 
when f'(z)) is small? 

(c) Apply (8.13-5) to f(z) of (8.13-6) with 6; = 0, i # 19 and 6,, = 277? for z9 = —1, —S, 
—8, and —15. 


CHAPTER 


NINE 


THE SOLUTION OF 
SIMULTANEOUS LINEAR 
EQUATIONS 


9.1 THE BASIC THEOREM AND THE PROBLEM 


Our concern in this chapter is with the solution of n simultaneous linear 
equations in n unknowns 


a; ;X; = b; i=1,....n (9.1-1) 


iM: 


J 


Equation (9.1-1) is conveniently written in the matrix form 


Ax =b (9.1-2) 
where A = [a;,] is the n x n matrix of coefficients, x7 = (x,,..., x,) and 
b’ = (b,, ..., b,) with T denoting the transpose.t We shall use matrix 


algebra and matrix notation extensively but not exclusively in this chapter. 
We assume everywhere in this chapter that A and b are real. 

We denote by A, the n x (n + 1) matrix which has the column vector b 
appended as an (n + 1)st column to A. We denote the rank of any matrix A 
by r(A). The basic theorem on the existence of solutions of (9.1-2) is as 
follows. 


t In this and the next chapter column vectors will be in boldface and row vectors will be in 
boldface with a superscript T. 
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Theorem 9.1 


1. The system of equations (9.1-2) has a solution if and only if 
r(A) = r(A,) 

2. If r(A) = r(A,) =k < n, then if x;,, x;,,--., X;, are k variables whose 
corresponding columns are linearly independent in A, the remaining 

n — k variables can be arbitrarily assigned; i.e., since there must be 
some set of k linearly independent columns of A (why?), there is an 


(n — k)-parameter family of solutions. 
3. If r(A) = r(A,) = n, there is a unique solution. 


Corollary 9.1 From 2 and 3 it follows that in the homogeneous case 
(b = 0), there is a nontrivial solution if and only if r(A) <n. 


This familiar theorem and its corollary can be proved using Gaussian 
elimination, which will be discussed in Sec. 9.3-1; the proofs themselves we 
leave to a problem {1}. 

In contrast to the equations of Chaps. 5 and 8, there is no problem in 
finding an analytic solution of (9.1-1). Cramer’s rule gives us such a solution. 
Instead the problem is in computing the solution. Even for quite low-order 
systems, the large amount of computation required to evaluate determinants 
makes the use of Cramer’s rule impractical. Therefore, our major aim in this 
chapter will be to develop more efficient computational algorithms for the 
solution of (9.1-1). 

The efficiency of an algorithm will be judged by two main criteria: (1) 
How fast is it; 1.e., how many operations are involved? (2) How accurate is 
the computed solution? 

These two criteria are aimed at the evaluation of algorithms for the 
solution of high-order systems (up to 500 equations or more) on a digital 
computer. Because of the formidable amount of computation required to 
solve (9.1-1) for large systems, the need to answer the first question is clear. 
The need to answer the second question arises because small roundoff errors 
may cause errors in the computed solution out of all proportion to their size. 
Furthermore, because of the large number of operations involved in solving 
a high-order system, the potential accumulated roundoff error is nontrivial. 
In Sec. 6.3-1 we had a glimpse of how such roundoff errors could cause 
substantial loss of accuracy. 

Before getting into the details of solving (9.1-1), in the next section we 
shall consider the problem in rather general terms in order to get an intuitive 
feeling for the difficulties that will be encountered. 
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9.2 GENERAL REMARKS 


Sources and types of problems The matrices of coefficients that occur in 
practice generally fall into one of two categories: 


1. Filled but not large. By filled we mean that there are few zero elements 
and by not large we mean matrices of order, say, less than 100. Such 
matrices occur in a wide variety of problems in statistics, mathematical 
physics, engineering, etc. 

2. Sparse and perhaps very large. In contrast to the above a sparse matrix 
has few nonzero elements. In most cases, these elements lie on or near the 
main diagonal. Very large may mean of order 1000 or more. Such 
matrices arise commonly in the numerical solution of partial differential 
equations {2}. 


It should not be surprising that different approaches are commonly used for 
matrices in these two categories. One cause of this is that the size of the 
matrices in the second category often makes memory space a problem even 
on the largest computers. But, basically, it is the different characters—sparse 
and filled—of the matrices that make the direct methods to be described in 
Secs. 9.3 to 9.5 generally superior for the first category, while the iterative 
methods of Secs. 9.6 and 9.7 are most often used for problems in the second 
category. It should be emphasized that there are no hard-and-fast rules in 
this, however, and indeed there is some controversy. Almost no one would 
recommend iterative methods for filled, low-order matrices, but there is 
substantial opinion in favor of direct methods for medium-sized sparse 
matrices. 


Ill condition Assuming that A is nonsingular, as we shall throughout this 
chapter, the solution of (9.1-2) can be written 


x=A™'b (9.2-1) 


Suppose that the elements of A have been normalized so that the largest in 
magnitude has order of magnitude unity. Suppose also that B = A~'‘ has 
some very large elements, one of which is 


A, 
ji rl (9.2-2) 
where | A| denotes the determinant of A and A,, is the cofactor of a,; and is 
therefore unaffected by a change in a,;. The assumption that b,; is large 
means that A;,; must be large relative to | A|. Since one of the terms in the 
expansion of | A| about the ith row or jth column is a,; A;;, a small error in 
a;; (relative to unity, ie., relative to the normalization of A) may cause a 
large relative error in | A| and therefore a large relative error in b,;. This in 


b 
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turn can cause a large relative error in x. Similarly, a small change in an 
element of b could cause a large change in x. This effect can also be produced 
by roundoff errors in the course of the computation (cf. Sec. 6.3-1) because a 
roundoff error introduced during the computation is equivalent in effect to 
an initial error in the elements of A (why?—see Sec. 9.4). 

An alternative way of looking at this problem is to consider the residual 
vector 


r=b-— Ax, (9.2-3) 


where x, is the computed solution. If A~' has some large elements, then r 
may be very small even if x, is substantially different from the true solution. 
For let x, be the true solution of (9.1-2), so that Ax, = b. Then (9.2-3) can be 
written 


r= A(x, — x,) (9.2-4) 
or X,—-x,=A7!lr (9.2-5) 


Therefore, if some elements of A~! are large, a small component of r can still 
mean a large difference between x, and x,, or conversely, x, may be far from 
x, but r can nevertheless still be small. This implies that we cannot test the 
correctness of a computed solution of (9.1-2) merely by substituting the 
result into the equations and calculating the residuals. Or to put it another 
way, an accurate solution, i.e., a small difference between x, and x, (see 
Sec. 9.4), will always produce small residuals if the matrix A is normalized, 
but small residuals do not guarantee an accurate solution. 

If the matrix A is normalized as described above and is such that A~' 
contains some very large elements, then we say the matrix and therefore the 
system of equations is ill-conditioned [see Sec. 1.7]. (Conversely, if the largest 
element in magnitude of A~‘ has order of magnitude unity, the matrix may 
be said to be well-conditioned.) The folowing simple example will illustrate 
the dangers inherent in solving ill-conditioned systems. Consider the 
system [see Prob. 27, Chap. 1] 


2x + 6y = 8 2x + 6.00001y = 8.00001 (9.2-6) 
which has the solution x = 1, y = 1, and the system 
2x + 6y =8 2x + 5.99999y = 8.00002 (9.2-7) 


which has the solution x = 10, y = —2. Here a change of .00002 in a,, and 
.00001 in b, has caused a gross change in the solution. The inverse of the 
matrix of coefficients in (9.2-6) has elements whose order of magnitude is 
10°, which indicates the ill condition of A. The necessity in the preceding 
discussion of the requirement that A be normalized can be seen by consider- 
ing the systems (9.2-6) and (9.2-7) both multiplied by 10°. Now a small 
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relative change in a,, and b, causes the same gross change in the solution as 
above although the elements of A~' are all of order magnitude unity. 

The coefficients in (9.2-7) might, for example, be empirical values of 
those in (9.2-6). If empirical values of the coefficients accurate to more than 
five decimal places cannot be obtained, then the solution of (9.2-7), no matter 
how accurately it is calculated, may be grossly in error. How to calculate 
solutions of (9.1-2), as accurate as the data warrant, when the system is 
ill-conditioned is probably the single most difficult problem encountered in 
the solution of simultaneous linear equations. In Sec. 9.5 we shall consider 
the solution of ill-conditioned systems in some detail. 


Sources of error There are three sources of error in the solution of systems of 
linear equations, two of which were mentioned above. The first is caused by 
errors in the coefficients and the elements of b. When such errors occur 
because these quantities are empirical, we must live with them. If a bound on 
the empirical errors is known, we can do no more than use this to get bounds 
on the errors in the solution (see Sec. 9.4). When the coefficients and the 
vector b are known exactly (as they are, for example, when a partial differen- 
tial equation is approximated by differences) but must be rounded when they 
are inserted into the computer, we can control this source of error by using 
double-precision arithmetic, if necessary. 

The second source of error is the roundoff error introduced in calculat- 
ing the solution. The third source is truncation error. In direct methods, e.g., 
Cramer’s rule or Gaussian elimination, which would lead to an exact solu- 
tion in the absence of roundoff, there is no truncation error. But the iterative 
methods to be discussed in Secs. 9.6 and 9.7 generally converge only as the 
number of iterations goes to infinity. They are therefore subject to trunca- 
tion error. One of the determining factors in deciding to use an iterative 
instead of a direct method is whether the truncation error can be made 
extremely small with an amount of computation comparable with, or less 
than, that required for a direct method. Truncation error is therefore almost 
always a minor source of error in the computed solution of (9.1-1). 


9.3 DIRECT METHODS 


As we indicated above, a direct method for the solution of (9.1-1) is one 
which if all computations were carried out without roundoff would lead to 
the true solution of the given system. Most direct methods involve some 
variation of the elimination procedure associated with the name of Gauss, 
which we shall now consider in detail. 
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9.3-1 Gaussian Elimination 

We write out the system (9.1-1) in the form 
AyyXy + Ay2X2 Fo + Ay Xn = Ay nt 
A1Xy + A22X2 + °°" + AayXy = G2, n+1 


(9.3-1) 


QniX1 + Qn2X2 ins annXn = Qn n+1 


where for notational simplicity we have written b; = a, ,,,. We assume, of 
course, that the matrix of coefficients is nonsingular. Suppose a,, #0. We 
subtract the multiple a;, /a,, of the first equation from the ith equation, 
i= 2,..., n, to get the first derived system 


Ay Xp + Ay2gX2 Fo + AyyXy = Ay ne 


(9.3-2) 
Dy2Xq + °° + Din Xy = Ap, ns 1 
The new coefficients af}) are given by 
i=2,...,n 
(1) _ b ’ 
aA}; = Ay — My Aj; j=2d.ntl (9.3-3) 


where m;,; = @;, /@,,,i = 2,..., n. Ifa,, = 0, then, because A is nonsingular, 
by interchanging two rows of (9.3-1) we can get a nonzero element in the 
upper left-hand corner (why?). We can also interchange two columns of A to 
achieve the same effect, but in this case the order in which the elements of the 
solution are computed will not be the natural order, x,, ..., x,, but some 
permutation of it. 

Now, if a} in (9.3-2) is nonzero, we subtract m;, = a‘})/a¥} times the 
second equation from the ith equation in (9.3-2), i= 3, ..., n, and get the 
second derived system 


AyyXy + AygX_g toc + AynXn = Ay n+] 
| ee 1 _ 
abi}x2 + + AX, = Prat 
2 2. _ ~Q 
aS3X3 t+ ASX = AS yn+1 (93-4) 


i=3,...,n 


where al?) = qi!) — 
=3,...,n4+1 


(9.3-5) 
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Again, if aS!} = 0, we can interchange two rows or columns to get a nonzero 
element in the (2, 2) position. Continuing with this process through n — 1 
steps, we arrive at the final system 


AyyXy HF AygXqtrrvrrrrrreeees HA yQXy = Gy, nt 
1) Ve 1), — 7 
aS} x. + + AX y — aS att 


ax, +++ + aYx, =aP,,., — (9.3-6) 


with the diagonal elements all nonzero and where 


al? =a -—m ak?  k=1,...,n-1 


j=k+1,...,.n+1 (9.3-7) 
i=k+1l,...,n 
ats) = aj; 
with m;, = afi /af” . Given (9.3-6), the solution is easily calculated ast 
1 _ Ge , 
xeaeplats- Ya] iamat 038) 
ii j=it+il 


The process leading to (9.3-6) is called Gaussian elimination; the calculation 
of the solution by (9.3-8) is called the back substitution. Using Gaussian 
elimination, it is easy to prove Theorem 9.1 {1}. 

A variant of the above process, which it is convenient to consider here, is 
the Gauss-Jordan reduction. In this technique we proceed as before to get 
(9.3-2), but in place of (9.3-4) we derive the system 


2 eee 2 —_ 2 

Ay,;X_y + a3x,+°:°+a?x, = aa 
1 1 Lee 1 — (2 

a3x, + a}x3 + + aS Xn —_ aS at 


ax, +-°°+ aPx, = a?,41 (9.3-9) 


2 a 2 _ (2 
ax; + + AX: —_— a?) 


in which the element in the first row and second column has also been 
reduced to zero, the remaining elements in the first row are given by (9.3-5) 
with i= 1, and a¥Y,,, = a¥,,.,. Continuing in this way, so that at each 


+ Here we use the convention that )7_,,, = 0. 
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stage all the elements in a column except the diagonal element are reduced 
to zero, we get finally 


_ (n-1 
G11 = at net 


(1) _ _(n-1 
a22X2 = ase 


(9.3-10) 


(n— 1) — _(n- 1) 
Ann Xn _ Qn nt+1 


with the a{i~') given by (9.3-7) and with 


k-1 k-1 
(k) _ ane} — man + i=l,...,k-1,k+1,...,n 
Gint1 = k-1) 
Gin+1 i=k 


The solution of (9.3-10) is then simply given by 


es 
ay bs bean (9.3-12) 


At first glance it might seem that the Gauss-Jordan reduction is to be 
preferred to Gaussian elimination, but we shall show now that in fact Gauss- 
ian elimination is the more efficient of the two. 

As usual in determining the number of operations required, we shall 
consider only multiplications and divisions. To estimate the number of oper- 
ations in Gaussian elimination, we use (9.3-7). For each k we need n — k 
divisionst (a{~ ?/ali”) and (n — k)(n — k + 1) multiplications. The total 
number of multiplications and divisions is then {4} 


n—-1 


M = Xen —k)(n—k+1)+ (n—k)] =4n? + O(n?)t  (9.3-13) 


We single out the n° term since it is only for large n that we are interested in 
M. The back substitution adds to M a term n(n+ 1)/2 and does not 
affect the n> term. 

For the Gauss-Jordan reduction, we again use (9.3-7), but this time with 
i running from 1 to k — 1 as well as from k + 1 to n. The calculation corre- 
sponding to (9.3-13) results in {4} 


M = 35n* + O(n’) (9.3-14) 


+ Or one division (1/a%~ ')) and n — k multiplications (1/a4~ »)a{~ ". 


t In this chapter, in contrast to Chap. 5, the notation O(n") will always refer to the situation 
as n— 00. 
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For large n, therefore, the Gauss-Jordan reduction requires about SO percent 
more operations than Gaussian elimination. For this reason, in the remain- 
der of this section we shall consider only Gaussian elimination. 


9.3-2 Compact Forms of Gaussian Elimination 


During the days of hand computation, the classical Gaussian elimination 
method involved many recordings of intermediate results. This was tiresome 
in itself and was also a source of copying errors. To eliminate the need for 
these intermediate recordings, compact forms of Gaussian elimination were 
developed. Although the recording of intermediate results on a digital com- 
puter is neither tiresome nor error-prone, compact forms have still been 
found useful in digital computation. The reason is that they reduce roundoff 
error by accumulating inner products and performing other arithmetic oper- 
ations in double precision. The reader may ask: Why not perform the entire 
computation in double precision to cut down on roundoff errors? The 
answer is that this would involve doubling the storage requirements of the 
problem whereas doing selected portions of the computation in double 
precision requires only minimal storage increase. The increase in computa- 
tion time is about the same whether all or just some of the computation, as 
discussed above, is done in double precision; this increase may be quite 
small on sophisticated computers. Thus, throughout this chapter, we shall 
point out situations where double-precision computation is possible at 
almost no extra cost in storage and computation time. One such situation 1s 
the back substitution given by (9.3-8). The quantity in brackets can be 
computed quite economically using double-precision arithmetic and, after 
division by afi” '), rounded to a single-precision number and stored as such. 


Matrix formulation Before deriving a compact form, we shall show how to 
express the Gaussian elimination process in matrix form. To this end, we 
introduce two types of special matrices which are modifications of the iden- 
tity matrix. The first type is denoted by P;; and has the form 


—= © 
Oo —_ 
ra) 
= 
—— 
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Premultiplying a matrix A by P;, interchanges rows i and j, while postmult- 
plication by P;,; interchanges the corresponding columns. Clearly P;; is sym- 
metric and P?, = I, so that P;;' = P,;. By convention, P;; = I {5}. 

The second class of matrices we require are unit lower triangular 
matrices L,; of the form 


column 


L, = (9.3-16) 


Mita, i 
My, i | 


where the m,, are defined following (9.3-7). An interesting property of the L,, 
which can be verified by multiplication {6}, is that L;! = L, has a form 
identical to that of L; with the signs of the m,; reversed. 

If we now denote by A, the matrix of coefficients in (9.3-2), then 


A,=L,P,,,,4 r,>1 (9.3-17) 


where we have assumed the interchange of rows 1 and r, before the elimina- 
tion step. (If no interchange was needed, r, = 1 and P, ,, = I.) In general, if 
we denote the matrix of coefficients in the ith derived system by A;, then 


A; = L;P;,,,Ai-1 i=1,....n—13;r,>i (9.3-18) 


where A, = A and where we interchange rows i and r; before the ith elimina- 
tion step. Finally, denoting the upper triangular matrix of coefficients A,_, 
in the final system (9.3-6) by U = [u;,], we have 


U = Ly 1Pn—1,7,-Ln-2 Pn-2,7,-, Ln-3 ee [;P,,,,A (9.3-19) 


If we had known the matrices P; ,, in advance, we would have applied 
them to A initially to yield the matrix 


A = Py-1.7,-,Pr-2,m-2 °° Pi., A= PA (9.3-20) 


where P is a permutation matrix, P = P,_, ,,_, °°: P1,,, and we would have 
been able to apply Gaussian elimination without interchanges. This implies 
that we can rewrite (9.3-19) as 


U = Dy Ly (Pa tire 7° Pisns)A = LA (9.3-21) 


where L is unit lower triangular since the product of two unit lower triangu- 
lar matrices is again unit lower triangular {7}. The inverse of L is 
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L=Ly,'-:: L,,, which is also unit lower triangular and indeed of the form 


{7} 
1 
m to ©O 
La] ceeeec eee eee eee ee eee ees (9.3-22) 
Mat My, n-1 1 
We thus have that 
A=PA=LU (9.3-23) 


This shows that if A is a nonsingular matrix, there exists a permutation 
matrix P such that A can be decomposed into the product of a unit lower 
triangular matrix L and an upper triangular matrix U. This product is called 
the triangular or LU decomposition of A. Furthermore, this decomposition 
is unique. For assume also that A = L, U,, where L, is unit lower triangular 
and U, upper triangular, so that LU = L, U,. Then since L and U, are 
nonsingular, we have UU; ' = L'L,. Since the inverse of a (unit) triangular 
matrix is of the same form and the product of two (unit) triangular matrices 
is also of the same form, we have that an upper triangular matrix is equal to 
a unit lower triangular matrix. This is possible only if both matrices are 
equal to the identity matrix, ie, that U; =U and L, =L, proving 
uniqueness. 

If we denote by b, the right-hand side of (9.3-6), then we have similarly 
that 


b, = LPb (9.3-24) 


so that b = Pb = Lb, (9.3-25) 


The details of the triangular decomposition are usually stored in the 
computer as follows. Initially, we define a vector v = (1, 2,...,n)". At stage j, 
we interchange the indices v; and v,,. Then we compute the m,,;, i > j, and 
store them in place of those elements of A, which are set to 0, that is, 
eliminated at that stage. At the end of stage n — 1, the initial matrix A has 
been replaced by U in the upper triangle and by L, without the unit diagonal, 
below the diagonal. From the final vector v, we can reconstruct the permuta- 
tion matrix P explicitly to compute b, or, as is more usually the case, we use V 
directly, to compute b. 

From this matrix formulation, we see that it is not necessary to carry b 
along during the Gaussian elimination process. In fact, once we have found 
L, U, and P such that (9.3-23) holds, we can solve (9.1-2) for any right-hand 
side as follows: 


LUx = PAx = Pb (9.3-26) 
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is equivalent to (9.1-2) since P is nonsingular. Setting y = Ux, we first solve 
the equation 


Ly = Pb=b (9.3-27) 


for y by the process of forward substitution using a formula similar to 
(9.3-8), 


i-1 
y=6,- Vly, i= 1...,n (9.3-28) 
j=z1 


We then solve the equation Ux = y by back substitution. As in the case of 
(9.3-8), we can do the arithmetic in (9.3-28) in double precision at little extra 
cost. 

From the above, we see that we can divide the Gaussian elimination 
algorithm for solving (9.1-2) into two distinct processes. The first is the 
triangular decomposition of A (9.3-23), which is independent of b. The 
second is the combination of forward and back substitution applied to Pb to 
get the solution x. This means that once we have found the LU decomposi- 
tion of A, we can solve (9.1-2) for any right-hand side b. In particular, we 
need not know the value of b at the time we perform the LU decomposition. 
This is in contrast to the Gaussian elimination algorithm where we carry b 
along during the elimination and do not save the m,;. The importance of this 
feature will become clear later on in our development. 


9.3-3 The Doolittle, Crout, and Cholesky Algorithms 


We shall initially assume that we are working with a matrix A in which no 
interchanges are necessary; subsequently we shall discuss how to introduce 
interchanges into the algorithms. The Doolittle algorithm is essentially 
another way to arrive at an LU decomposition of A while allowing for 
double-precision calculation of inner products. Since we assume that A has 
an LU decomposition, let us try to find L and U by equating corresponding 
elements of A and LU. We thus have 


min (i, j) 


ai; = De li tas = py li Up 1, J = l, 220 NN (9.3-29) 


There are n? equations in n* unknowns, the n(n + 1)/2 elements of U and the 
n(n — 1)/2 subdiagonal elements of L. These equations can be solved recur- 
sively by using the properties of triangular matrices in several ways. In view 
of the fact that we wish to allow the possibility of row interchanges, we shall 
organize the computation as follows, computing in succession one row of U 
followed by the corresponding column of L. 

Setting i = 1 in (9.3-29), we immediately have that 


Uy, =a; jJ=1,..., 7 (9.3-30) 
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since |,, = 1. Then setting j = 1 and i > in (9.3-29), we have 


Qi 


lL, =— i=2,...,n (9.3-31) 
Ur 
Setting now i = 2 and j >i in (9.3-29), we find 
Ur; = a2; — Io iuy; j=2, -2+ N (9.3-32) 
Then, setting j = 2 and i > j, we have 
jp = 2 eg in (9.3-33) 
Ur2 


In general, we compute row r, r = 1, ..., n, of U by 
r-1 
ur; = a; — Yo be UE jJ=yr, oo N (9.3-34) 
k=1 
followed by column r, r = 1,..., n— 1, of L, using 


r—-1 
air — »y lig Uy 
L,, = ———— i=r+1,...,n (9.3-35) 


rr 


At each stage, all values entering into (9.3-34) and (9.3-35) have been 
computed at previous stages. Our assumption that no interchanges are 
necessary ensures that all u,, # 0, so that the algorithm is fully defined. As is 
evident from the equations, we can compute the right-hand sides of (9.3-34) 
and (9.3-35) in double precision, rounding only at the end to yield single- 
precision values u,,; and /;,. It is worth noting that if we compute (9.3-34) and 
(9.3-35) using single-precision arithmetic, the results are identical with the 
values of L and U arising in the LU decomposition based on Gaussian 
elimination {9}. 

The Crout algorithm differs from the Doolittle one in that it generates a 
unit upper triangular matrix U and a general lower triangular matrix L such 
that 


LU=A (9.3-36) 


Following the same method as before, we first generate column r of L using 
the equations 


i, = a; — »y Ly Une i=r,...,n (9.3-37) 
k 
followed by row r of U, given by 


r-1 
a,j; — »; Lay; 
i, = _ j=rt+,...n (9.3-38) 
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The Crout algorithm has a slight advantage over the Doolittle algorithm 
when we wish to do row interchanges, as we shall see in the next section. 

The two algorithms are closely related. In fact, the diagonal of U 1s 
identical with that of L {10}. If we write D = diag (u;,), then 


A = LU = (LD)(D~1U) = LU = LDU (9,3-39) 


Thus, both the Doolittle and Crout factorizations are special cases of the 
LDU factorization of A into a unit lower triangular matrix, a diagonal 
matrix, and a unit upper triangular matrix. 

We shall return to the question of row interchanges after we discuss the 
subject of pivoting in the next section. First, however, we shall give an 
algorithm for the triangular decomposition of a positive definite symmetric 
matrix A. Such a matrix can be decomposed without row interchanges in the 
form 


A=LE (9.3-40) 


where L is a real nonsingular lower triangular matrix. The proof of this fact 
is instructive in that it gives an algorithm for decomposing matrices formed 
by adjoining a row and a column to a matrix A,_, of similar form using 
the result of the decomposition of A,_,. This feature is useful in solving 
least-square problems using normal equations when the desired order 
of the approximation is not known in advance [see Sec. 6.3-2]. 

Assume now that we have a decomposition of A,_, = L,—,L,-1, where 


A, = Kan Pee | (9.3-41) 


We can assume this induction hypothesis since A, _ , 1s also positive definite. 
This follows from the fact that if a matrix is positive definite, all its principal 
minors are positive definite {14}. Now define c,—, = Lj: b,-1, 
X = (Gan — Cn—1€n—-1)"/” and 


L,= oe : (9.3-42) 


cr, x 
Then it is easy to see that {11} 
L,LT = A, (9.3-43) 


The only thing remaining is to show that x is real. To this end, we take 
determinants of both sides of (9.3-43) and use (9.3-42) to get 


det (L,—1)’x? = det (A,) (9.3-44) 


Since det (A,) > 0 because A, is positive definite, it follows that x is real. 
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The elements of the rth row of L can be computed directly from (9.3-40) 
using the equations 


i-1 
y Lglij + Lili = 4; i= 1, coe PO l (9.3-45) 
j=l 
r-1 
Y be + P. = a, (9.3-46) 
j=l 


so that this decomposition, called the Cholesky factorization of A, requires n 
square roots and about n?/6 multiplications. From (9.3-46) we see that 


which implies that all elements of L are bounded by max q;;’”. 


Example 9.1 Apply the Cholesky algorithm to the positive definite symmetric matrix A 
below using 5-digit floating-point decimal arithmetic. 


136.01 90.860 0 0 
A 90.860 98.810 —67.590 0 
0 —67.590 132.01 46.260 
0 0 46.260 177.17 
Forr=1: Pi=ay IL, = 11.662 
For r = 2: Lily, = @ Li = 7.7911 
2, + 2, = a2 ls, = 6.1733 
Forr=3: Iaidyy = a3) I, =0 


il + l,ol,, = 32 I. = 10.949 
Be + 2,4 1, = ay, l,, = 3.4827 
Forr=4: lavdiay = 4a l4, =0 
larlay + lgales = Ag. l42 = 
lay I, + laals, + las l;3 = Ag; las = 13.283 


2, + 2, + 2, + By = Aga laa = 85552 


11.662 0 0 0 
71.7911 6.1733 0 0 
Thus L= 0 —10.949 3.4827 0 


0 0 13.283 = 85552 
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As a check we compute LI’ and find that 


136.00 90.860 0 0 
Lit = 90.860 98.811 —67.591 0 
0 —67.591 132.01 46.261 
0 0 46.261 177.17 


which agrees with A to within rounding error. 
Although A is tridiagonal, we did not use this fact in setting up the equations for the 
elements of L. 


If we actually need the factorization of A in the form LL’, then we must 
compute the square roots implied by (9.3-46). However, if we are only 
solving a system of equations, we can save the computation of the square 
roots as follows. 

Let D = diag. (I,,); then 


A =(LD7~')D*(D~"I7) = LDL™ (9.3-47) 


where L is unit lower triangular and D is a positive diagonal matrix. The 
rth row of L and rth element of D are given by the equations {12} 


1j=4; i=l...,r-1 (93-48) 


d, =A, — > G,jl,; (9.3-49) 


Note that in both of the decompositions, we can compute in double 
precision. 


9.3-4 Pivoting and Equilibration 


Our discussion of Gaussian elimination makes it clear that when the ele- 
ment a{i~!>= 0, we must interchange two rows in order to be able to 
cuntinue the process. Now it is a general rule in numerical analysis that 
when a certain process cannot be carried out in a particular situation, it 
probably cannot be carried out successfully in a neighboring situation. Thus, 
if af” is not exactly 0 but is close to 0, we can expect that the elimination 
process may suffer from numerical instability. Therefore to minimize the 
possibility of such instability, we seek at each stage the number farthest from 
0; that is, we find the row r; > i for which |a“~ "| is maximal, r =i, ..., n, 
and interchange it with row i. Thus, the current value of a{i~ !) will be the 
furthest from 0. The element a“;") is called the pivot element and the row r,, 
the pivot row. The process of so choosing a‘,) is called partial pivoting, in 
contrast to complete pivoting, where we interchange both rows and columns 
to find the row r; and column c¢; for which |a%~ "| is maximal, r = i, ..., n, 
c=i,..., n. While this latter process has some theoretical advantage over 


426 A FIRST COURSE IN NUMERICAL ANALYSIS 


partial pivoting, it is almost never used in practice since it is more time- 
consuming and does not give better results in realistic situations. 


Example 9.2 Solve the following system of linear equations by Gaussian elimination using 
10-digit decimal floating-point arithmetic. 


3 x 107 ''x, + X4 = 7 
(9.3-50) 
X,+xX,=9 
Without pivoting, we compute first m,, = 3333333333 x 10'* and end up with the 
triangular system 


3x 10°''x, +x, =.7 


(9.3-51) 
— 33333 33333 x 10!2x, = 23333 33333 x 10! 


which yields by back substitution the values 

x, = .70000 00000 x, = .00000 00000 
in contrast to the true solution correct to 10 digits, 

x, = .20000 00000 xX = .70000 00000 


With pivoting, we interchange rows | and 2 and then find that m,, = .3 x 107!" giving 
the triangular system 


X, +X,=.9 (9.3-52) 
X,=.7 ) 


which gives the correct solution after back substitution. 
Note that the error in the first case is not due to any accumulation of roundoff error 


but only to a single large rounding error which arises from the large value of m,,. This 


results in a} being essentially independent of the value of a'?}, which can range over a 


large set of values without changing (9.3-51). 


Now, let us multiply the first row of (9.3-50) by 10'* and solve the 
resulting system using partial pivoting. Since this time the coefficient of x, in 
the first equation is larger than that in the second, we do not interchange 
rows and we find that our results are identical with the results from the 
solution of (9.3-50) without pivoting. For pivoting to be effective, all the 
rows of A should be of the same order of magnitude, so that in searching for 
the pivot element we are comparing numbers which originated in vectors 
of about the same size. The simplest way to accomplish this is to require 
that the L, norm of each row of A be unity. This 1s called equilibration by 
rows. It is not always satisfactory, but it generally works {19}. In practice 
we can do either an explicit equilibration by dividing each element in a row 
by the element of maximum magnitude in that row or an implicit equilibra- 
tion, as below. The main disadvantage of explicit equilibration is that it 
introduces an additional rounding error in the elements of A. 

We now show how to implement the Doolittle and Crout algorithms 
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with partial pivoting and implicit equilibration. Initially we compute the 
values 


d, = max |a 
l<jsn 


ul] oi=l.in (9.3-53) 
For the Doolittle algorithm assume that we have reached the rth stage and 
have computed u;;,,i=1,...,.r—1j2iand],,j=1,...,r—1,i>j, and 
that these have overwritten the corresponding a,;;. We now compute quanti- 
ties s; in double precision defined by 


r-1 


S$; = ai, — yy li Une l= re (9.3-54) 
k=1 


and store s; at the end of row i. The s,’s are the quantities which would have 
resulted in rows r to n of column r after the first r — 1 stages of Gaussian 
elimination {9}. Now let p be such that 


= max 


rsisn 


Sp S; 

d, d, | (9.3-55) 
Then this p is the index of that row from rows r to n which would have had 
the element of largest magnitude in column r after r — 1 elimination stages if 
the original matrix had been explicitly equilibrated. The calculation of p by 
(9.3-55) therefore represents implicit equilibration and determination of the 
pivot element. Using this p, we interchange rows r and p of the complete 
array augmented by the s; and also interchange the elements v, and v, of the 
vector v.} If we still refer to the new element in the (i, j) position as ];; or a; 
whichever is relevant, we have [cf. (9.3-34) and (9.3-35)] 


jy? 


r—1 
Up=S,  Ujs=a;—- Via j=orti....n 
k=1 
(9.3-56) 
S; . 
.=— =r+1,...,n 
u 


rr 


Note that the Doolittle method with partial pivoting requires the storage ofa 
double-precision vector s. 

In the Crout algorithm, on the other hand, we proceed as follows at the 
rth step. Compute [, by the formula [cf. (9.3-37)] 


r-1 
L, =a, — Lin tee i=Tr, --- + MN (9.3-57) 
k=1 


and overwrite a,, by [;,. The /;, here are also computed in double precision, as 
the s; were in the Doolittle algorithm. However, since we do not do any 


+t We do not actually have to interchange the two rows explicitly. It can be done implicitly 
by replacing the row index, say r, throughout by the corresponding index »,. 
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further computation with these values, we store them as properly rounded 
single-precision numbers in place of the a,,. If 


d d; 


Pp t 


= max (9.3-58 ) 


r<isn 


interchange rows r and p of the comnlete array and also v, and v,, as before. 
Then compute @,; by means of [cf. (9.3-38)] 
r—1 
a,j ~ »y Lek Uy; 
k=1 


aA 


U,; = 


j=rdtl,...,n (9.3-59) 
As in the Doolittle method, (9.3-58) determines the pivot element using 
implicit equilibration. 

With partial pivoting and no equilibration or explicit equilibration we 
have that in the Doolittle method |J,;| < 1 and in the Crout method |/,;| < 
|/,,;|. However, this is not so when implicit equilibration is used. Then we 
only have that |J,;/d;| < 1 and |I,;/d;| < |J,/d;|, respectively. 

Since the Crout algorithm 1s a little more economical in storage, it 
should be used, since its stability properties are essentially the same as those 
of the Doolittle algorithm. We could modify the Doolittle algorithm so that 
it required the same amount of storage as the Crout algorithm by rounding 
the s; to single precision and overwriting a,, by s;, i = r,..., n. However, in 
this case, we would lose accuracy, so that again the Crout algorithm 1s 
preferable. Of course, in using the Crout algorithm, Eqs. (9.3-28) and (9.3-8) 
for forward and back substitution must be modified accordingly {20}. 


Example 9.3 Apply the Doolittle algorithm with partial pivoting and equilibration to the 
matrix A below using 5-digit floating-point decimal arithmetic. 


136.01 90.860 0 0) 
90.860 98.810 —67.590 0 
~ 0 —67.590 132.01 46.260 
) 0 46.260 177.17 


We first initialize the vector v ts be (1, 2, 3, 4)’. The values d;, i= 1, ..., 4, are 


d,= 13601 d,=98810 d3=13201 d,=177.17 


For r = 1, we have that s;=a;,,i= 1,..., 4. 
Since |s, /d,| = max,<,<4 |s,;/d;|, we do not interchange rows. We now set 
uU,,;=S8; uy, =4,, j=2,...,4 
S; 
L,=— i=2,...,4 
Uy 


The new matrix has the form 


136.01 90.860 0 0 
.66804 98.810 —67.590 0 
0 —67.590 132.01 46.260 


0 0 46.260 177.17 
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and v = (1, 2, 3, 4)? is unchanged. For r = 2 
S; = Aj. — 1,442 i=2,...,4 
5, = 38.11188 560 s,;= —67.590 5,=0 


Since |s,/d,| = max,<;<, |s,/d,|, we interchange rows 2 and 3 in the matrix as well as s, 
and s,. After this interchange 


Un, = S82 = — 67.590 Uy, = 4,,— 15,44, j=3,4 
l.=— i= 3,4 


The new matrix is 


136.01 90.860 0 0 
0 — 67.590 132.01 46.260 
.66804 — .56387 —67.690 0 
0 0 46.260 177.17 
and v is now (1, 3, 2, 4)’. 
For r = 3, 
S; = Ay — Uy tys — j2U23 i= 3,4 
S3 = 6.84647 8700 S4 = 46.260 


Since |s,/d,| > |s,/d,|, we interchange rows 3 and 4 in the matrix as well as s, and s,. 
We then have 


S 
U3, = 53 = 46.260 Usq = Agq — Ly, Uyg — by2U24 l= — 
33 
The new matrix is 
136.01 90.860 0 0 
0 — 67.590 132.01 46.260 
0 0 46.260 177.17 
.66804 — 56387 .14800 0 
and v = (1, 3, 4, 2)’. 
Finally, for r= 4, S4 = Aga — lay Ura —_ lao Usd4 _ las Usz4 = — 1365338, and 


Ugg = — .13653. 
We now have the factorization PA = LU, where 


1 0 0 
1 0 
0 0 1 
66804 —.56387 .14800 


- © Oo © 


136.01 90860 0 0 
y =| 9 67.590 132.01 46,260 

0 0 46.260 177.17 

0 0 0 -.13653 
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and P= 


oo o=— 
—- OO © 
oo — © 
ore OO © 


Note that although A is symmetric and tridiagonal, we have not used these properties of A 
in this decomposition. 


9.4 ERROR ANALYSIS 


Were we to carry out the Gaussian elimination algorithm in any of its 
equivalent forms using exact arithmetic operating on a true matrix A anda 
true right-hand side b, we would obtain a true answer x, = A™ 'b. Indeed, 
there now exist programs which compute an exact rational solution of 
(9.1-2) when the elements of A and b are rational numbers in a reasonable 
amount of time for moderate values of n, say up to 100 or so. However, since 
most matrices arising in practice are either based on experimental data or 
result from a lengthy computation, the values of the elements of these 
matrices are not exact. Thus, it makes no sense to invest the considerable 
effort needed to compute an exact solution to an inexact problem, inasmuch 
as the time required for an exact solution is greater by several orders of 
magnitude than that required for an approximate solution using Gaussian 
elimination and floating-point arithmetic. 

In the course of this section we shall use backward error analysis [see 
Sec. 1.6-1] to show that the computed solution x, of the problem (9.1-2) 
is the exact solution of a neighboring problem 


(A+ E)x=b (9.4-1) 


where E is small and we can determine bounds on the elements of E. Note 
that the right-hand side of (9.4-1) is the same as that of (9.1-2). This does not 
mean that b has no effect on the matrix E. Indeed, E depends on b. However, 
as we shall see in Sec. 9.4-1, we can find bounds on the elements of E 
independent of b. Hence, it is important to know how this perturbation of A 
affects the solution. This problem is of interest in itself aside from considera- 
tions of roundoff error. For if A arises from experimental data, the elements 
of A are determined up to experimental error, so that the value of the (i, j) ele- 
ment of the true matrix may be anywhere in the interval [a,; — €;, ai; + &,], 
where ¢,,; > 0 is the experimental error in the accepted value a,;. Hence it 1s 
important to see how the exact solution of (9.1-2) is affected by changes in 
the elements of A, and similarly in those of b, even though in this latter case, 
the results do not affect the bounds obtained by roundoff error analysis. 

We, therefore, consider first exact solutions of related problems. The 
simplest case is that where only b changes, that is, 


Ax = b + 6b (9.4-2) 


THE SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS 431 


If we denote the solution of (9.4-2) by x, + 5x, we find that 5x satisfies the 
equation 


Adx = db (9.4-3) 

so that 5x = A! db (9.4-4) 
Taking norms (see Sec. 1.3-3), we find that 

|x| < Am *] - [ab] (9.4-5) 


Since we are almost always interested in the error relative to the solution x,, 
we should divide by ||x, |]. Now clearly 


bi] < |All - Ix (9.4-6) 


so that 


b 
Ix,| >t (9.4-1) 


Dividing inequality (9.4-5) by inequality (9.4-7) yields the result 
|5x| | 5b] 


qo |All: JAF 
Ix, = EAR TA ay 


The quantity ||A|| - ||A~*|| will appear frequently in the sequel and will be 
denoted K(A). These numbers, defined for the various matrix norms we have 
been using, give a measure of the condition of A and are always greater than 
or equal to 1 {21}. In particular 


K,(A) = ||Al[2 A> ‘ll. (9.4-9) 


(9.4-8) 


is called the spectral condition number of A. 
A lower bound on |/8x||/|[x, || can be similarly derived by reversing the 
roles of x and b and is given by {22} 


Jax|_ 1 [ab 
[xl = K(A) [bi 0-410) 


From (9.4-8) we see that the larger K(A) is, the greater can be the influence 
of an error in b on the accuracy of the solution of (9.1-2). Thus, if K(A) is 
close to unity, we say that the matrix A is well-conditioned and if it is large, 
we say that A 1s ill-conditioned. 

If we were just interested in the effects of perturbations in the data on the 
results of exact computations, we could compute K(A) by computing A™' 
and using a computable norm such as the L, or L,, norm. However, our 
main purpose in introducing this perturbation analysis is to study the effect 
of roundoff error on the computed solution. Generally we cannot compute 
A~} exactly, and our computed value of K(A) may be very inaccurate. 
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Fortunately, there are some situations in which we can determine a lower 
bound on K(A) based on the following theorem. 


Theorem 9.2 Let A be a nonsingular matrix and B any singular matrix. 
Then 


1 _ 
ja—ay <!4 Il mn) 


Proor Since B is singular, there exists a vector x # 0 such that Bx = 0. 
Hence for this x 


| Ax|| = 4x — Bx|| = (A — B)x| <4 — BI - |x| 94-12) 
But |x|] = A> 2x] < 4-8 Ax (9.4-13) 


from which (9.4-11) follows. 


The meaning of this theorem is that, for normalized matrices A such that 
|| A|| = 1, the condition of A depends on how close 4 is to a singular 
matrix. Theorem 9.2 has the following interesting corollaries. 


Corollary 9.2 If A is nonsingular and B is a matrix such that 
1/||A — B|| > ||A7*||, then B is nonsingular. 


Corollary 9.3 If C is a matrix such that ||] — C|| < 1 for some norm, 
then C is nonsingular. Similarly, if |D|| < 1, then J — D is nonsingular. 


ProoF In Corollary 9.2, take A = / so that ||A~ ‘|| = 1 and B= C. The 
second part follows from the first by setting D = I — C. 


From Corollary 9.3, it follows {23} that diagonally dominant matrices 
are nonsingular, where a matrix A is said to be diagonally dominant if 


| a;;| > yy | a; ;| i=l,...,n (9.4-14) 
j= 1 
ie 
Since all the principal minors of a diagonally dominant matrix are them- 
selves diagonally dominant (why?), it follows that pivoting is not necessary 
in theory in the LU decomposition of such matrices {23}. 
We now turn to the case where there is an error in A and see how it 
affects the accuracy of the solution. To this end we look at the equation 


(A + 6A)(x, + x) =b (9.4-15) 
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This situation is more complicated than that in the previous case in that 
A + 6A may be singular. Therefore we seek a condition on 6A which assures 
us that A + 0A is nonsingular. Since 


A+6A=A(I + A7! 5A) (9.4-16) 


it follows from Corollary 9.3 that a sufficient condition for this to hold is that 
||[A~ * 6Al| < 1, which we shall henceforth assume. In fact, we shall assume a 
stronger condition, namely that ||/A~*|| - |6A|| < 1. With this assumption, 
we have that 


x, + 6x = (1 + A~' 6A) 'A™'b (9.4-17) 
so that 6x = [(I + 47! 5A)! — I], (9.4-18) 

Now, if for some matrix C such that ||C|| < 1 we define 
D=(1+C)!-I (9.4-19) 
then 7 +cC)D=-—-C (9.4-20) 
|Di| — || - DI < ||| — ed] < |e] (9.4-21) 

Cc 

and ||D|| < 5 I a (9.4-22) 


Hence, we have that 


AT" Al: xl] - AT TIL: NATL: xe 
| —||A™* Al] ~ 1 |]A™*] - [164] 


|3x|| < (9.4-23) 


From this it follows that 


|x|] K(A)|6Al/A| 


OXH ANCA NAN 9.4-24 
Ix,| <1 — K(A)|6Aj/]4| 0.424) 


giving an upper bound for the relative error in x in terms of the relative error 
in A. Here again the condition number K(A) figures prominently. 

A simpler bound on the relative error of 5x can be derived with the error 
relative to the perturbed solution x, + 6x. From (9.4-15) we have that 


A 6x + 6A(x, + 6x) = 0 (9.4-25) 

so that 5x = —A™! 5A(x, + 5x) (9.4-26) 
lax] 7 5A] 

and "I< |/A~ | - [6 Al] = K(A) (9.4-27) 
Ix, + 8x] </4 | Heal [A] 


Returning to the problem of roundoff error, we see that if we can show 
that the computed solution x, satisfies exactly the equation 


(A + E)x,=b (9.4-28) 
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then using (9.4-15) and (9.4-24) gives 


Ix: = Xel| 


KA 1) |E|/|| A] 

— He <j A oe 

ix <1 — K(A)|EV/AI ea) 
Hence, two factors determine how good the computed solution 1s, the size of 
E and the condition of A. There is nothing we can do to change K(A). 
However, as we shall see, the size of E depends on the precision of the 
arithmetic used in the computation. Hence, if the bound given by (9.4-29) is 
too large, we can remedy the situation by going to higher precision, 
provided, of course, that A is known exactly. This higher precision will also 
help in reducing the initial roundoff error which occurs when an exact A is 
entered into the computer. If, on the other hand, A is uncertain to within the 
matrix 0A, the best we can do is to reduce E to the same order of magnitude 
as OA. Further reduction of E will not contribute to any significant improve- 
ment in the error bound. 


9.4-1 Roundoff Error Analysis 


We shall now show that the computed solution x, of (9.1-2) using the Doo- 
little and Crout algorithms satisfies exactly the equation 


(A + E)x,=b (9.4-30) 


where E is a matrix, depending on both A and b, with elements which are 
usually small relative to A. We assume that A has been explicitly row- 
equilibrated and that the rows of A have been permuted in such a way that 
no interchanges are required throughout the algorithms. Note that the latter 
assumption results in no loss of generality. We further assume that all the 
arithmetic operations, namely the accumulation of inner products and their 
addition to single-precision numbers and subsequent division by single- 
precision numbers, have been carried out in double precision. As we have 
noted above, this does not require much extra storage and usually not too 
much additional computation time and hence is the recommended mode of 
operation. We shall first study the Doolittle algorithm since it is easier to 
analyze, in that it is more closely connected to Gaussian elimination. We 
shall then make the necessary modifications to accommodate the Crout 
algorithm. 

Recalling (9.3-34) and (9.3-35), we have that the computed value u,, 
satisfies exactly the equation [cf. (1.5-29) and Prob. 22 of Chap. 1] 


r-1 
U, j = («, —_— Y bata + C,;) J= Vr, ...,N 


lé,;| < 27? (9.4-31) 
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where the /,, and u,; are previously computed values and where we are 
working with p-digit binary floating-point arithmetic. Here and in the re- 
mainder of this section, we shall be neglecting errors of the order of 27 ??. 
Therefore, the only roundoff error of order 2~? is that committed when the 
double-precision right-hand side of (9.3-34) is rounded to single precision. 
Similarly the computed value of |;, exactly satisfies the equation 


r-1 
air — > din Uer 
|, = ——=4+—(l+6,) i=r4+1,...,n 


le, |< 27? (9.4-32) 


Hence, in view of the fact that J,,=1 and that 1/(1+6«,)=1—6;+ 
O(2~ 2”), we have from (9.4-31) and (9.4-32), respectively, that 


Aj +6 jUjy= Vat  f=enoyn (9.4-33) 
k=1 
and ai, + ip lip Upy — > lig Uy i=rt+ I, ree N (9.4-34) 
k=1 


Combining these two equations for r = 2,...,n, and recalling that u,; = a,;, 
j=1,..., n, we have that 


A+F=LU (9.4-35) 
where the elements f;; of F satisfy 
0 
fig = 4 Git j2i j=1,...,n (9.4-36) 
leh = <i 
Now by the assumptions of no required interchanges and explicit equilibra- 
tion, |1:;| < 1, j <i, so that 


| fii] < 2° 9 (9.4-37) 


where g = max, , |u,,;| is called the growth factor. This growth factor can be 
determined a posteriori and is usually of the order of unity. A precise a priori 
bound for g is 2"~ ', as we shall see below, but in practice it is extremely rare 
for g to exceed 4. Furthermore, for various special types of matrices such as 
symmetric positive definite, diagonally dominant, and band (Sec. 9.11) 
matrices, much smaller theoretical bounds than 2”~ ! exist for g {24, 68}. Thus, 
in most practical situations, we see that under our above assumptions, the 
elements of F are of the order of a single roundoff error. 

The bound 2"~' on g is determined as follows. The element u,; can be 
identified with the element af," '’ in Gaussian elimination. We shall prove by 
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induction that |af, | < 2'"', k, =i, ..., n. From this it will follow that 
\u,;|<2'-'<2"-'. By assumption af <1. Assume now _ that 


ja | <2'-'. We shall show that |af)| <2), k, l=i+1,...,n. From 
(9.3-7) 


at’) = at 1) Mi a‘ " (9.4-38) 


Since |m,;| < 1 by assumption, we have that 


lay? | < fal | + jaf ?| <2! (9.4-39) 
as claimed. 
In the Crout decomposition, we have, by similar arguments {25} that 
A+F=LU (9.4-40) 
0 j=1 
where i= | é, 1; j<i j>l i=l,...,n (9.4-41) 
| éij li uj; Jol 


Since neither |/,;| nor |i;| is bounded by unity, it would appear that the 
situation here is worse than in the Doolittle case. However, with exact 
computation, we have that 


l; = Ui; Ui; = i lj = l. uj; (9.4-42) 


Hence, except for the first row and column, the elements of f, j are essentially 
the same as those of f;;, and we have the same bound as in (9.4-37). 

We now investigate the processes of forward and back substitution. In 
the Doolittle case, we have from (9.3-28) that 


i-1 
n= (b- ly} +e) i=1,...,n 
j=l 


je;] < 2°? (9.4-43) 
from which we easily calculate 
b, = Y Lijyj + Viei i= l, NN (9.4-44) 
j=1 
or (L+6L)y=b 6L=diag (e;) (9.4-45) 


The back substitution then gives in a similar fashion {25} 
(U + 6U)x=y (9.4-46) 
where 6U = diag (n;u;;) ln;| < 27? (9.4-47) 
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Hence x, is the exact solution of 


(L+ 5L)(U + 6U)x =b (9.4-48) 
which, using (9.4-35), is equivalent to 
(A+F+0LU+L6U+ 6L 6U)x =b (9.4-49) 
If we set 
E=F+6LU+L6U+6L6U (9.4-50) 


and ignore terms of order 2~?, we see that (9.4-30) holds with 
ei; = fij + €;Uij + Luin; (9.4-51) 


Note that, through (9.4-43) and a similar equation from back substitution, 
the values of e;; depend on b but (9.4-45) and (9.4-47) show that we have a 
bound on them independent of b. A similar situation holds for the modified 
forward and back substitution formulas connected with the Crout 
algorithm {25}. 

If we now use (9.4-37) and the fact that Land U are triangular to bound 
the elements e;;, we find that 


ley] < (2 + 54; — 641)27?g (9.4-52) 


where 0,; is the Kronecker symbol; similarly for the é;; in the Crout method, 
we have {25} 


eij| < (2 + 5 — 5,;)27"9 (9.4-53) 
Hence we have that 
JE], El] < (2n + 1)g27? (9.4-54) 


for all standard norms. 
If the entire computation is carried out in single precision, we have the 
markedly weaker bound [Wilkinson (1967), p. 82] 


Ello [Ello < (1.06)(3n? + n*)g2-? (9.4-55) 


However, it should be noted that these are extreme upper bounds and take 
no account of the statistical distribution of roundoff errors. 


9.55 ITERATIVE REFINEMENT 


Once we have computed a solution x, to (9.1-2), we can calculate the residual 
vector 
r=b-— Ax, (9.5-1) 


Now, in some situations, we are not really interested in a solution which is 
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close to the true solution of (9.1-2), x, = A~'b, but only in a vector x for 
which the residual r is small. In such a case, if |r|] or, better, |/r||/||b]| is less 
than some preassigned tolerance, we are through. Later in this section, we 
give an algorithm for computing a solution with a smaller residual when |r| 
or |r||/||b|| for x, is not sufficiently small. This algorithm is also applicable to 
the more usual situation when we want a solution x, such that the error 
vector 


e=x,-—X, (9.5-2) 


is small relative to x,. In this case, the size of r may not give any indication as 
to the size of e even though 


Ae=r (9.5-3) 


as can be verified by premultiplying (9.5-2) by A. Proceeding as in Sec. 9.4 
[cf. (9.4-3), (9.4-8), and (9.4-10)], we find that {26} 


toll e fell cep gy lel 95-4 
K(A) by = xs = 8 poy Oo4) 
Thus, if the condition number of A is close to unity, then small relative errors 
in r and e go together. However, for ill-conditioned matrices, small relative 
errors in r can be associated with large relative errors in e and vice versa. Of 
one thing we can be certain. If A is normalized such that ||A|] = 1 and |e] is 
small, then r is small. This follows from (9.5-3), which implies that 


[rl] < |All: [lel] = [lel (9.5-5) 


But notice that this does not say anything about the relative error in r. Of 
this we know only that 


Ini. eygy lel — ) 4-1) tel 95.6 
yop A ig I P=) 

The perceptive reader may notice that since r can be calculated for each 
computed solution x,, we can solve (9.5-3) for e, which when added to x, will 
yield the true solution x,. This would indeed be true if we could solve (9.5-3) 
exactly. However, roundoff errors enter here also, so that we are only able to 
compute an approximation to e, e,. Will this help us? The answer 1s that it 
depends on the condition of A and the precision of the arithmetic. But before 
entering into a discussion of this question, let us extend and formalize the 
above process, called iterative refinement. 

The first requirement, and a crucial one, is that the residual r be 
computed using double-precision arithmetic throughout, i.e., not only in the 
computation of Ax, but also in the subtraction of Ax, from b. The reason 1s 
that r will usually be of the order of magnitude of the roundoff error, and 
hence unless it is computed to higher accuracy, the relative error in the 
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computed r could be almost 100 percent, so that the solution of (9.5-3) with 
such an r would bear no relation at all to the error e. 

We now set up an iterative scheme as follows. Start with the initial 
computed solution x’. Now define the residual at stage m by 


r™ = b— Ax™ m=1,2,... (9.5-7) 


Define the (computed) error e” as the computed solution of 


Ae =r™ (9.5-8) 
and the next approximate solution as 
xt) = xl) 4 el (9.5-9) 


If x*") is not satisfactory, we proceed to stage m+ 1. We shall accept 
x'"*”) as our solution if ||e||/||x|| is less than a prescribed tolerance, i.e., if 
the computed error at stage m does not change the significant digits of 
interest in x”. This stopping criterion is not foolproof and counterexamples 
have been constructed {27}. However, in practice this criterion works quite 
well. 

We must discuss the conditions for the convergence of this process, for it 
may happen that ||e || will never decrease. First, however, we shall mention 
some practical aspects of the computation. The first 1s that we must save the 
original A and b in order to compute the residuals. It is important to point 
this out since in the algorithms we have given, we have assumed that A was 
overwritten, and in practice b is also usually destroyed. The second point is 
that (9.5-8) can be solved in the O(n?) operations of forward and back 
substitution if we have a triangular decomposition of A. Here is the point 
alluded to previously about the importance of being able to solve (9.1-2) for 
a right-hand side unknown at the time of the LU decomposition of A. Each 
stage in the iterative refinement process is thus seen to be quite fast relative 
to the time required for the solution of the original problem. Finally, we 
must remember to compute (9.5-7) to double-precision accuracy. 

We now return to the question of when the iterative refinement process 
will converge. The answer is usually if |le"|| ,, /||x ||, is less than 3, that is, if 
x") has at least one correct binary digit. Otherwise, the matrix is too ill- 
conditioned, and if a solution is desired, the LU decomposition must be 
carried out with greater precision. To indicate why this is so, let us assume 
that the residual r*” is computed exactly and that e™) is added to x 
exactly. Assume now that 


JE «PAO TY] = 27? <4 (9.5-10) 
where e’ is the exact solution of 


(A+ E™)e™ = r™ (9.5-11) 
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E™ is a function of r, but, as we have seen above, we have a uniform 
bound for all E™. Now 


ef” = (A + E™)"*(b — Ax™) = (A + E™)~ 1 A(x, — x™) (9.5-12) 
and xD = XM 4 (4 + EM) LAX, — xm) (9.5-13) 
so that x, — x?) = [1 — (A+ E™)'A](x, — x™) (9.5-14) 
Hence, as in Sec. 9.4 and using (9.5-10), we have 


[Ant EL Ie = x | 


Ix — xin | < 
Pe Sa EI 


2 
<j |x, — x (9.5-15) 
Since 2~?/(1 — 27”) < 1, we have that x") + x,, so that the process con- 
verges. We do not know the value of ||E || - || A~ "||, but we do know, as in 
(9.4-27), that 


leds yeny yay (05-16) 

Ix] * : 
Hence, we use the criterion that [le ||/||x|| < 4 to start the iteration. It still 
may not converge, but chances are slight. Nevertheless, we must make provi- 
sion for such a likelihood in our algorithm by stopping after a maximum 
number of iterations. Since each iteration should give at least one additional 
significant digit, we stop after p iterations. 

As can be seen, ||E“||- ||A7~*|| can be quite large even though 
le ||/|x@ |] <4, and that is why counterexamples exist. However, a pro- 
gram based on the above considerations and those given in the previous 
sections will almost invariably give the correct solution to (9.1-2) when the 
condition of the matrix warrants it and will indicate the need for higher 
precision otherwise. 


9.6 MATRIX ITERATIVE METHODS 


The methods we shall consider in the next two sections are analo- 
gous to the methods of functional iteration of the last chapter. Starting with 
an initial vector x,, we shall generate a sequence of vectors 


Xj+1 = F{x;, Xj— 1. ees X;-x] (9.6-1) 


where the i subscript on F denotes that the iteration function itself may 
change from one iteration to the next. Analogously to Chap. 8, if the itera- 
tion function is not dependent on i, we say that the iteration is stationary. 

For most matrices A iterative methods require more computation to 
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achieve a desired degree of convergence than the direct methods we have 
discussed. Why consider them then? We noted the basic answer to this 
question in Sec. 9.2. This is that for sparse matrices—which arise commonly 
in solving partial differential equations by difference techniques—iterative 
techniques may indeed compare favorably with direct methods in terms of 
total amount of computation. Furthermore, because they are economical in 
their use of the computer memory, iterative methods are particularly advan- 
tageous for the very large matrices that often occur in the numerical solution 
of partial differential equations. 

In some situations, the matrix A need not be stored at all. Instead, each 
element a;; is computed every time it is needed in the calculation. Usually the 
computation is trivial, such as setting a,; to some constant, but in more 
involved problems, the computation may be quite time-consuming. 
However, this may still be the most efficient way to solve the problem since it 
may not be possible to store even the sparse matrix A in the high-speed 
memory, and so even a time-consuming computation of a,; may still be faster 
than the retrieval of a,; from auxiliary storage. 

We shall restrict ourselves to linear iterative processes, 1.e., iterations in 
which F;, is a linear function of x;, x;_ ,,..., X;-,. The primary reason for this 
is not that we are dealing with linear equations but rather that nonlinear 
iterations are much more difficult to analyze in general. Moreover, they tend 
to be computationally inefficient because of the number of matrix-vector 
products which must be calculated. As in Chap. 8, our interest will be 
focused on one-point iteration functions. A linear one-point matrix iteration 
has the form 


Xj41 = B. X; + C; (9.6-2) 


where the matrix B; and vector ¢; are independent of i in a stationary 
iteration. 
To motivate considering iterations of the type (9.6-2) to solve 


Ax =b (9.6-3) 
let us write (9.6-3) in the form 
(i+ A)x=x+b (9.6-4) 
or x = (1+ A)x —b (9.6-5) 
Equation (9.6-5) suggests the iteration 
X;+1 = (1 + A)x; —b (9.6-6) 


Equation (9.6-2) is then just a generalization of (9.6-6). 


+ Computer programs to handle systems of up to order 108,000 have been written! 
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As in Chap. 8, we require that the true solution x, of (9.6-3) be a fixed 
point of (9.6-2). Therefore, analogously to (9.6-5) we must have 


x, = B;x,+ ¢; (9.6-7) 
for all i. Since x, = A~ 'b, we have 
A-'b=B,A™'b+¢, (9.6-8) 
or ec; = (I — B)A-'b=C,b (9.6-9) 
We assume that B; and C; are independent of b. Therefore, we must have 
(I — B,A~' =C;, (9.6- 10) 
or B,+ C,A=I (9.6-11) 


This is called the condition of consistency for B; and C;. 
Because of (9.6-9) we can rewrite (9.6-2) in its more usual form 


X;4, = B;x;+C;b (9.6-12) 
In order to consider the convergence of (9.6-12), we define 
€;= x; —X, (9.6- 13) 
Using (9.6-7), (9.6-9), and (9.6-12), we have 
€;., = B,;x;+ C,b — x, = B; x; — B;x, = B;€; (9.6-14) 
If x, is the initial approximation to the solution of (9.6-3), then 
€41, = Ke, (9.6-15) 
where K; = B;B,_, °°: B, (9.6-16) 


Therefore, a necessary and sufficient condition for the convergence of the 
sequence {x;} to x, for arbitrary x, is that 


lim K,;y = 0 for all y (9.6-17) 


i co 


Another necessary and sufficient condition for convergence is that 


lim je; || = 0 (9.6-18) 
K, 
and since max leesa ll x [Kier] | K; || (9.6-19) 
a lle og lel 
lim || K;|| =0 (9.6-20) 


is also such a condition. For stationary processes still another necessary and 
sufficient condition for convergence is 


lim p(K;) = 0 (9.6-21) 
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9.7 STATIONARY ITERATIVE PROCESSES 
AND RELATED MATTERS 


Because they are easier to analyze and computationally more desirable, 
most matrix iterative methods are stationary. Our main concern in this 
section will be with stationary iterations. In Sec. 9.7-4, however, we shall 
consider briefly the use of nonstationary iterations to accelerate the conver- 
gence of stationary iterative processes. 

When an iteration is stationary, then B, = B, C; = C, and from (9.6-16) 


K, = Bi (9.7-1) 


The eigenvalues of B' are the ith powers of the eigenvalues of B. Therefore, the 
condition (9.6-21) is equivalent to requiring that all eigenvalues of B lie 
within the unit circle (why?). 

If the iteration converges, the rate at which it does so ts of interest. From 
(9.6-15) 


lei+a {| < [Kill - feel] < |B lel (9.7-2) 


Therefore, the smaller some norm of B is, the more rapidly we expect the 
iteration to converge. If B is symmetric, the spectral norm and spectral 
radius are equal and the latter can then be used as a measure of the rate of 
convergence. If, as is more usual, B is not symmetric, then although we know 
that p(B) < ||B||, it is not obvious that the spectral radius can be used to 
measure the rate of convergence. However, it can be shown, although it is 
beyond our scope here [see Varga (1962), pp. 61-68], that even in the 
nonsymmetric case the spectral radius does measure the rate of convergence. 
Although it is usually impractical actually to calculate the spectral radius, the 
effect of the spectral radius on the rate of convergence has important 
ramifications; see, for example, Sec. 9.7-4. 


9.7-1 The Jacobi Iteration 
We write the matrix A in the form 
A=D+L+U (9.7-3) 


where D is a diagonal matrix and Land U are, respectively, lower and upper 
triangular matrices with zeros on the diagonal. Then (9.6-3) can be written 


Dx = —(L+U)x+b (9.7-4) 
which suggests the iteration 
X;414 = —D-'(L+ U)x; + D~'b (9.7-5) 


We assume here naturally that the diagonal of A contains no zero ele- 
ments. If it does have zero terms but A is nonsingular, then by permuting 
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rows and columns it is always possible to get a nonsingular matrix D. It is in 
fact desirable to have the diagonal elements as large as possible in relation to 
the off-diagonal elements. For suppose we set x, = 0. Then x, = D™ 'b, and 
if the diagonal terms are dominant, this is already a good approximation to 
the solution. 

The iteration (9.7-5) is known by many names, but the most usual are 
the Jacobi iteration or the method of simultaneous displacements. The latter 
name follows from the fact that every element of the solution vector is 
changed before any of the new elements are used in the iteration (cf. 
Sec. 9.7-2). For this method the matrix B is 


B= —D~'(L+ U) (9.7-6) 


It may be quite difficult to determine whether B is such that the iteration 
converges. Of course, if ||D~'(L+ U)| <1 for some easily computable 
matrix norm, then we are certain that the iteration does converge. 

The Jacobi method as we have presented it here is seldom used in 
practice for the solution of (9.6-3). This is largely because the Gauss-Seidel 
method to be considered below almost always converges when the Jacobi 
method does, may converge when the Jacobi method does not, and generally 
converges faster than the Jacobi method. Furthermore, the implementation 
of the Gauss-Seidel method on a computer is more efficient than that of the 
Jacobi method. 


9.7-2 The Gauss-Seidel Method 


The difference between the Jacobi and Gauss-Seidel methods is that in the 
latter, as each component of x;,, is computed, we use it immediately in the 
iteration. For this reason the Gauss-Seidel method is sometimes called 
the method of successive displacements. 

Denote the jth component of x; by x¥. Consider the system of equations 
written in the form (9.1-1). To compute x{'), we use the first equation in the 
form 


1/< ; 

f= — (Sax? 64] (97-7) 
Qi1 \j=2 

(We assume here as before that D is nonsingular.) To get x{?), we use the 

second equation, but with x‘! replaced by the result of (9.7-7) 


l " 
2) _ 1 (i) 
xi), = ——|[a,x!), + Y ayjxl — b, (9.7-8) 
a22 j=3 
In general 
l r-1 n 
xf, = ~( y a, ;x~h + Y a, xt _ , 

rr \j=1 jart+il 
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With A written in the form (9.7-3), this method becomes {31} 


Xi41. = —D™'(Lx;,, + Ux;) + D~'b (9.7-10) 
which can be solved for x;,, to give 

Xj41 = —(D+ L)"'Ux;+ (D+ L)"'b (9.7-11) 
Since B= —(D + L)"'U, a sufficient condition for convergence is then that 


|(D + L)-*U|| < 1, but this is not very useful in practice (why?). In the 
important case, however, in which A is positive definite, we can prove that 
the Gauss-Seidel method converges. 


Theorem 9.3 If the matrix A is positive definite, the Gauss-Seidel itera- 
tion (9.7-11) converges independently of the initial vector. 


PROOF We write A as 


A=L+D+U (9.7-12) 

since A is symmetric. The matrix B is then 
B=-(D+L)'L (9.7-13) 
Let —A/ and v be, respectively, an eigenvalue and eigenvector of B. Then 
(D+ L) v= av (9.7-14) 
or Uy = A(D + Ly (9.7-15) 


Even though A is positive definite, the eigenvalues of B may still be 
complex. We havetf 


v*L'y = v*A(D + L)v (9.7-16) 
Adding v*(D + L)v to both sides of (9.7-16), we get 
v*Av = (1 + A)v*(D + L)v (9.7-17) 


Since A is real and symmetric, the conjugate transpose of the left-hand 
side of (9.7-17) leaves this quantity unchanged. Therefore, 


(1 + A)v*(D + L)"v = (1 + A)v*(D + L)v = (1 + A)(v*Dv 
+ v*Lv) = (1 + A)[v*Dv + Av*(D + L)"v] (9.7-18) 


the last line following from use of the conjugate transpose of (9.7-15). 
Rearranging the terms, we have 


(1 — [A|?)v*(D + L)’v = (1 + A)v*Dv (9.7-19) 


+ v* denotes the conjugate transpose of v. 
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Multiplying both sides of (9.7-19) by 1 + A and then using the conjugate 
transpose of (9.7-17), we get 


(1 — [A|*)\w*Av = [1+ A/?v*Dv (9.7-20) 


Since A is positive definite so is D; moreover, no eigenvalue —/ of Bcan 
equal 1 (why?). Therefore, we must have 1 — |A|* >0, which means 
that the eigenvalues of B lie within the unit circle. This completes the 
proof. A partial converse of this theorem will be found in a 
problem {32}. 


We have noted that much of the application of iterative methods such as 
the Jacobi and Gauss-Seidel methods is in the numerical solution of partial 
differential equations. The coefficient matrices that arise in the numerical 
solution of partial differential equations are often such that the iteration 
matrix B in (9.7-1) is nonnegative; i.e., all the elements of B are nonnegative. 
The Perron-Froebenius theory of nonnegative matrices, a subject which is 
beyond the scope of this book [see Varga (1962), pp. 26-33], provides the 
basis for analysis of iterative methods when B is nonnegative. We content 
ourselves here with stating the Stein- Rosenberg theorem, which follows from 
the Perron-Frobenius theory. Suppose that by dividing each equation by its 
diagonal element we make D in (9.7-3) the identity matrix. Suppose further 
that B, = —(L+ U) in (9.7-6) is nonnegative. Then it follows {33} that 
Bg = —(1 + L)"'U in (9.7-11) is also nonnegative. Let p, and pg be the 
spectral radii, respectively, of B, and B,. Then this theorem states that one 
of the following conditions holds: 


Condition 1: Py = Pg =9 
Condition 2: Py =Pco=!1 
Condition 3: 0<pg<p,;<!1 
Condition 4: l<p,< pc 


Thus the Jacobi and Gauss-Seidel iterations both converge or both diverge, 
and when they both converge, the Gauss-Seidel method converges faster 
(except for the trivial case, condition 1). This theorem is the basis for our 
previous statement that the Jacobi iteration is seldom used 1n preference to 
the Gauss-Seidel iteration. 

Equation (9.6-14) implies that matrix iterative methods converge 
linearly. They are therefore candidates for the 6? process (see Sec. 8.7-1). 
Analogous to Eq. (8.7-4) we can write 


(r) 2 
xt) mw ay, — Avett) r=1,...,n (9.7-21) 


where the superscripts denote components of the relevant vectors. 
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Example 9.4 For the system 
4x) — x) =2 


x!) + 4x@) = 2 


do four iterations of the Jacobi and Gauss-Seidel methods and then use (9.7-21) to get an 
improved solution. (Although of too low an order to arise in a practical problem the form 
of the coefficient matrix is typical of those which arise in the numerical solution of partial 
differential equations.) Use x, = 0 for both methods. 

For the Jacobi iteration 


B= —D°'(L+ U) 


10 o]f 0 -1 Oo 
=—]0 4 of|-1 o -1]= 


ot © 
hl © Pe 
Oat © 


00 4 0 -!1 0 
From (9.7-5) we calculate 
x2 = (3 3 2) 
= 829) 
x4 = (18 Te te) 


For the Gauss-Seidel iteration, using (9.7-9) we calculate 
T 
x} = (4, 44, 43) 
T 
x3 = (35, “2, 338) 


xT = (252, 1921 2045) 
4 256° 512° 2048 


T _ (2045 8189 16,381 
x5 = ( /2048) [4096 / 16.384) 


and, since the true solution is x/ = (1, 2, 1), it is clear that both iterations are converging 

and that the Gauss-Seidel is converging more rapidly. Calculation of the spectral radii of 

the B matrices for the two iterations would indicate just how fast each is converging {34}. 
If we apply (9.7-21) with i = 2 to the Gauss-Seidel results, we have 


GQ) 24 (2) 24 (3) 21 
Ax,’ = Ax’ = Ax§ 


3 256 312 = 2048 
2.(1) _ 85° 2,.(2) _ 147 2,.(3) _ _ 147 
A’x,’ = —3e6 A°xy) = — $7 A°xs’ = —3048 


This gives as the new approximation 
(333, 2, 1) 


which is an improved result in all components and a perfect result in two of them. 
However, it is important that the 6? process not be used too early in the computation, in 
which case it may give poorer results than the last iteration {34}. 


9.7-3 Roundoff Error in Iterative Methods 


The roundoff error incurred in an iterative method is only that which is 
incurred in the last iteration. This is because we always use the original 
coefficient matrix and, so to speak, start from scratch at each iteration using 
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the last iterate as the initial vector. It is perhaps natural to expect then that 
roundoff error is a much less severe problem in iterative methods than it is in 
direct methods. However, not only can roundoff be a serious source of error 
in iterative methods, but also it may be very nearly as serious as in direct 
methods. In this section we shall consider roundoff error in the Gauss-Seidel 
iteration. In analogy with the error analysis of Sec. 9.4, we ask: What are the 
perturbations of the matrix A and vector b such that, starting from x; as 
computed, the result of using these perturbations in (9.7-11) would have 
been the computed value x;,, 1f no roundoff error were incurred? 

Before studying the effect of roundoff errors, let us rewrite (9.7-9) to take 
into account only nonzero elements a,;. We then have 


1 [ ke , kar , 
xo,=- _-( aj x,t Yo ajyx—- , r=1,...n (9.7-22) 


rr \j=kip J=kp+i,r 


L<ky<-'<k,<r—-l<rt+ l<kyii<c<ky <n (9.7-23) 


and a,; #0 if and only if j =k,,1=1,..., q. 
If we now calculate x‘) , using p-digit binary floating-point operations, 
we find that the x‘), satisfy exactly the equation 


1 kpr p 
ss = 211 ayath(l +e) EL ny 

rv \ Lj=kir s=m(j) 

kor q—pP 

+S ax +6) TT a+ m)|tt + 6) — 6) (+ 5K + 95) 
JHkp+i,r s=m'(j) 
(9.7-24) 

where m(j) =2 J = kip m'(j) =2 j= K+ l,r 


mMjy=1 j=k,;1>1 m(ij)=l-p j=Hk,; l>pt+i1 


and €; = relative error in the multiplication of a,; by x} or x), 
II(1 + 7,) =relative error in the accumulation of the sum with 7, 
corresponding to €¢, in (1.5-26) 
6, = relative error in the addition of the two sums 
6, = relative error in the addition of b, 
6, = relative error in the division by a,, 


Multiplying (9.7-24) by a,,/(1 + 6,)(1+ 63) and expressing the result in 
vector-matrix notation, we have 


Dx;+1 = ~Ix,,; —_ Ux, —_ b (9.7-25) 
a, (1 + é;) Il (1 + ns)(1 + 61) r >IsJ = kip 


0 otherwise 


where —_ 
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la(l+e)[] (+n) +6,) r<jsj=k;, — (9.7-26) 


rj . 
0 otherwise 


and |e;|, |ns|, [6] < 27”. 

If, however, the computation of (9.7-9) is carried out in double- 
precision arithmetic operating on the single- precision numbers a,,;, x, x!) ,, 
b, with only a final roundoff to obtain a single precision value for x¢" 1, then 
(0.7- 9) takes the form 


] r—1 
xO = - ; = (Yay + y a,;xy — b\(l + 53) (9.7-27) 


j=1 j=rtil 
in which case we have 
Dx;., = —1x;,, —_ Ux; — b (9.7-28) 


with d,, = a,,/(1 + 53). 

Thus, when the computations are carried out in double precision, the 
computed vector x,,, is the exact result of operating on x, with a matrix A 
which differs from A by at most one unit on the diagonal. In the single- 
precision case, A is not that close to A. However, for sparse matrices, A will 
differ from A by at most of the order of Q units in the last place, where Q is 
the maximum number of nonvanishing elements in a row of A. Note that the 
b which appears in all the equations above is the true right-hand side. 

In general, this amount of roundoff error will not be serious, especially 
since in an iterative process the truncation error incurred by terminating the 
iteration will usually be of a greater order of magnitude. However, it can be a 
serious problem when the system is ill-conditioned. In {35} a simple example 
illustrating this point is considered. 


9.7-4 Acceleration of Stationary Iterative Processes 


Since the rate of convergence of a stationary iterative process depends on 
the spectral radius of B, any modification of the matrix B that will reduce the 
spectral radius will increase the rate of convergence. We consider briefly here 
a method of accelerating the convergence of an iterative process called the 
method of successive overrelaxation. First we rearrange the system (9.6-3) so 
that each diagonal element of A is 1. This can be done by arranging the 
system so that no diagonal term is zero and then dividing each equation by 
its diagonal element. For positive definite matrices, we preserve symmetry 
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by pre- and postmultiplying A by D~'/*. When we have done this, 
Eq. (9.7-9) for the Gauss-Seidel iteration becomes 


n 
xt} = _ by a,,xP) ~~ y a, ;x}? + b, 


j=rt+l 
= xi) — by a,,x) , — Laux! + b, (9.7-29) 
since a,, = 1. Consider replacing (9.7-29) by 


r-1 

x9, =x of 2 ya, ; x!) + ya, x) — -, (9.7-30) 
j= jer 

where w is the overrelaxation factor. Using (9.7-30), we can write this itera- 

tion in matrix form as {36} 


X44 = B,x;+C,b (9.7-31) 
where B,=([+@L)'[(1—o—oU]  C, =o(I +L)" (9.7-32) 


with Land U as in (9.7-3). The crux of the matter now, of course, is to choose 
w So as to minimize p(B,,). The reader will probably not be surprised to find 
that this is a problem of some difficulty. It is in fact the subject of a large 
literature. We content ourselves here with stating that for a large class of 
matrices that arise in the numerical solution of partial differential equations, 
there is a simple relationship between the optimum value of w and p(L + U). 
This does not get rid of the problem, but finding the eigenvalues or at least 
estimates of the eigenvalues of L + U may be reasonably simple. Moreover, 
the insight gained from the relationship between w and p(L.+ U) is useful, 
for example, in indicating that it is preferable to overestimate w than under- 
estimate it. 

Successive overrelaxation is only one of a large class of methods that 
have been developed to accelerate the convergence of iterative processes. 
Some of these accelerations lead to nonstationary iterations; for example, 


X41, =[(1 + @)B— @,I}x; + (1 + @,)Cb (9.7-33) 


where the relaxation factor w; changes from step to step. The proper use of 
such acceleration procedures lies at the heart of much of the utilization of 
high-speed computing equipment for the solution of partial differential 
equations. 


9.8 MATRIX INVERSION 


The solution of simultaneous linear equations can certainly be accomplished 
by inverting the matrix A. In fact, if the system (9.1-1) is to be solved for 
many different right-hand sides, one may first invert A and then calculate 
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A~'b for each right-hand side. However, it is more efficient to compute the 
LU decomposition of A and then solve for each right-hand side using for- 
ward and back substitution. The only time it is necessary to compute the 
inverse of a matrix is when the elements of A™! are explicitly needed, as in 
certain statistical calculations or in the matrix-updating procedures for solv- 
ing systems of nonlinear equations (Sec. 8.8). One very efficient way to invert 
A in general is a simple extension of one of the algorithms of Sec. 9.3-3. 

We suppose that the matrix A which we wish to invert has been 
decomposed into a product LU by the techniques of Sec. 9.3-3, where L has 
the form (9.3-22) and U the form 


Upp vecececeecces Uin 
Ur2 
U = Te 9.8-1 
©) L 
Then since 
A'=U''L! (9.8-2) 


our problem is to invert the triangular matrices L and U. We leave to a 
problem {40} the derivation of the result that the inverse of a lower (upper) 
triangular matrix is lower (upper) triangular. 

Let L* = {r,,} with r;; = 0, i <j, and let L; and L; * be, respectively, the 
ith column of L and the jth row of L''. Then for 1 <k<n 


Ly Ly = 1 = 


k 
L,'L,=90= Virgil; j=k-1,...,1 


i=j 


(9.8-3) 


from which we can calculate r,;, i=k, k—1, ..., 1, for any k, thereby 
obtaining all the elements of L™. 

Similarly we can calculate U~', the only difference being the fact that 
the diagonal elements of U are not 1s {41}. The calculation of L>' and U~' 
followed by the matrix multiplication U~'L™! requires 4n° + O(n?) multi- 
plications and divisions {41}, so that n® + O(n?) are required for the com- 
plete inversion. Some variants of the procedure of this section and other 
methods of matrix inversion are considered in the problems {42 to 47}. 


9.9 OVERDETERMINED SYSTEMS OF 
LINEAR EQUATIONS 


In this section we consider an alternative approach to the least-squares 
problem considered in Chap. 6, an approach which has application not only 
to the case of fitting a function to experimental data but also to such prob- 
lems as, for example, the approximation of a function given on a finite set of 
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points. The problem stated in Eq. (6.2-1), namely approximating observed 
values f; by a linear combination of a set of functions, can be reformulated as 
an attempt to solve the system of linear equations 


Ay =b (9.9-1) 


where A is a matrix of n rows and m columns. [In the notation of Chap. 6, 
b; =f, and a,,; = ,(x;), and we have replaced m by m — 1.] In the usual case, 
n>m, so that the system (9.9-1) is overdetermined and cannot be solved. 
Instead we attempt to minimize the norm ||r|| of the residual vector 


r=b-— Ay (9.9-2) 


where the norms of interest are the L,, L,, and L,, norms. We shall first 
discuss the case of the L, norm and return to the other two later. 

When we try to minimize the L,, or least-squares, norm, we obtain the 
normal equations of Chap. 6, which, in our formulation, have the form [cf. 
Eq. (6.2-6)] 


AT Ay = Ab (9.9-3) 


It can be shown that the matrix A’A is positive definite and hence 
nonsingular if the columns of A are linearly independent {51}. Nevertheless, 
as discussed in Sec. 6.3-1, the matrix A’A may be ill-conditioned and for 
moderate values of m, the solution y will not be accurate. Hence, an alterna- 
tive method must be used. One such method for the case where the 
coefficients a,, = x/~' was given in Sec. 6.4, namely the use of orthogonal 
polynomials. Here we present another method applicable to the general case 
(9.9-1). Before we can give the details of this method, however, we need some 
preliminary definitions and results on special kinds of matrices. 


Definition 9.1 A matrix Q is said to be orthogonal if Q'Q = I. 


Since this says that Q7 = Q™', it follows that QQ7™ = J. Orthogonal 
matrices have the property that for any vector x the Euclidean length of Qx, 
||Qx||2 = ||x||2. This follows immediately from the definition since 


| Ox||2 = (Qx)"Ox = x"Q"Ox = x"x = ||x|[3 (9.9-4) 


Definition 9.2 An elementary reflector P is a matrix of the form 


P=I—2v" where v'v=1 (9.9-5) 


These matrices are also known as elementary Hermitian matrices or House- 
holder transformations (cf. Sec. 10.3-3). They reflect the space in the hyper- 
plane through the origin orthogonal to v. The matrices P are symmetric and 
orthogonal {52} and have the important property that, given any two 
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vectors of equal length, x and y, we can find a matrix P such that Px = y. To 
this end we take v = (x — y)/||x — y||,. We prove this as follows: 


2(x — y)(x — y)" 2(x"x — y"x) 


I[- x=x- 
Ix — yl]3 


xt — yx —xty pyty *— 9) 9-6) 
Since x?x = y’y by hypothesis and y’x = x’y since this is a scalar quantity 
which is invariant under transposition, it follows that the denominator in 
(9.9-6) reduces to 2(x’x — y’x) and we have that Px = y. 

With the aid of this result, we can find a sequence of Householder 
transformations P,, P,,..., P,, which when applied to A yields a matrix of 
the form (R, 0)", where R is an upper triangular matrix. Let us assume for 
the moment that we have found a matrix P = P,, P,,-,°*: P; such that 


R c 
ra=[t] me fe oo 
Then Pr = 4 — PAy= 4 — AL = rad (9.9-8) 
Since P is orthogonal, 
lr] = /Prllz = le — Ryll2 + lldll3 (9.9-9) 
and |r||3 is minimized if e — Ry = 0. Thus, our least-squares solution is 
y=R'e (9.9-10) 


and the residual vector r is given by r = P7(0, d)’. 

If the computations are carried out in the proper sequence, it can be 
shown [Lawson and Hanson (1974)] that there is a matrix E and a vector e 
such that 


P(A + E)= 4 P(b + e) = H 


where R, é, and d are the computed values and where the norms of E and e 
are small relative to those of A and b, respectively. In fact, if inner products 
are computed using double-precision arithmetic, the elements of E and e are 
of the order of the rounding error. In addition, the computed solution y 
satisfies (R + F)y = é, where F is also small relative to R. If we now set 


F 
= E+ Ppt 
G= E+ Pr 


then, because P is orthogonal, G is also small and we have 
R+ ‘| 


P(A +6)= | 0 
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Hence, y is the true solution to the least-squares problem of minimizing 
(A + G)y — (b + e)||,. We thus have a stable algorithm. Furthermore, the 
system Ry = ¢ is better-conditioned than (9.9-3). In fact, if we use the spec- 
tral condition number K,(B)=||Bl|,||B™'||,, it is easy to see that 
K.,(A™A) = K3(R) {53}. Hence, the above method is recommended for the 
solution of the general least-squares problem. 

We note in passing that when A is an n x n matrix, 1.e., when we have 
the usual system of linear equations Ax = b, we have that 


PA=R c = Pb Rx =c x=R'¢c 


and we have a solution that is very stable and does not require pivoting. 
However, this process is more costly than LU decomposition with equili- 
bration and partial pivoting {54} and since the latter scheme is stable for all 
practical purposes, it is preferable. 

It remains to show how to choose the P;. It is easy to find P, so that the 
first column of A, = P, A, has zeros below the diagonal where we have set 
A, = A. Using the result above, we set x equal to the first column a, of A,, 
and y= (+|l/a,|2, 0,...,0)7. With v, = (x — y)/||x —yl., P, =1— 2v,v7 
does the trick, since the first column of P,A, 1s then P,a, = P,x = y. For 
computational purposes we choose the sign to be equal to —sgn (a,,) so as 
to reduce the roundoff error. 

The problem arises with P,. We must choose P, so that the subdiagonal 
elements in the second column of A, = PA, vanish while at the same time 
ensuring that the first column remains unchanged. To this end, we notice 
that if we take v to be of the form v = (0, v2, ..., v,,)7, then the corresponding 
P is of the form 


100 -: 0 
Ox x: x 
QO x x x 


and PA, will have the same first row and column as A,. Hence, to get P, we 
take x=a,, y=(a,., +(l\a2 ||} -—7,)'7, 0, ..., OJ” and with 


v2 = (x — y)/|x — yl], P2 = 7 — 2v2v3. 
In the kth step, we take x = a,, 
k-1 1/2 T 
Y= [41k> -++> B-1,k> (Il - 5 a8] 0, ..., 0 
i=1 


v, = (x — y)/||x — y|., and P, = I — 2v, vj. In each case a, refers to the kth 
column of the matrix A, with the sign chosen as —sgn (a;,). 
After m steps, we have that A,,4, = P,, °°: Py A, = (R, 0)’. 
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Example 9.5 Solve the problem of Sec.6.5 using the method of Householder 
transformations. 
In this problem we are given the empirical data 


10.3627 


and we wish to find the least-squares approximation by a polynomial of degree 4. To this 
end, we set up the 9 x 5 matrix A with elements defined by a,; = x/~' and the vector 
b with b; = f,. After applying a sequence of five Householder transformations to (9.9-1), we 
find that 


— 3.00000 —1.50000  -—.95000 —-—.67500 —.51110 


0 .77460 .77460 .67235 .57010 
R= 0 0 .17550 .26325 .29208 
0 0 0 —.03776 ~—.07551 
0 0 0 0 — .00767 


ec = [—20.93590 4.91058 1.43284 —.18832  —.00766]’ 
and d x 10° =[.4407 3020 -—.3129 —.3233]7 
Solving the system Ry = c by back substitution, we find the following values for the 
coefficients of the polynomial y4(x) = \joo y,;x’: 
Yo = 5.00097 y, = .99199 sy, = 2.01720 y3 = 2.98987 V4 = -9988 


which is almost identical with those given by (6.5-5). 
If we apply the transformation P’ to (0, d)", we get the vector r of residuals given by 


rx 10° =[—.0326 .1256 —.2328 .3600 —.4254 .1843 .1652 -—.2037 .0594]’ 


We now proceed to the L, and L,, cases. To do this, we must first 
discuss the linear programming (LP) problem. A quite general formulation 


of this problem is as follows. 
Find a vector x = (x;, X2,..., x,) which minimizes the linear function 


cPx = C,X, +02X, + °° + GX, (9.9-11) 
subject to the constraintst 

alx=b, i=1,...,5, 

aix<b, i=s,+1,...,5 (9.9-12) 


x; 20 j=t,t, t1,...,t 


where the a; are vectors of dimension t. In the next section, on the simplex 


+ If we have a constraint of the form a/x > b;, we multiply through by —1 to get the 
inequality —a/x < —b,, which fits into the framework of (9.9-12). 
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method, we shall show how to solve this problem. Here we are concerned 
with the question of how to convert the L, and L,, minimization problems 
to LP problems. 

Consider first the L,, case. We are seeking a vector y = (yj, ..., Ym)" 
such that |r||,. = |b — Ay]. = max,<;<, |b; — Ayy| = minimum, where 
A, denotes the jth row of A. To this end, we introduce a new variable y,,, 
which will be equal to max, |b; — A,;y|. Then, for every j we have that 

—Ym+1 Sb; -AjYSIYme1  J=ly...n 


Since we wish to minimize this maximum value, our problem reduces to the 
LP problem: find a vector ¥ = (y,, ..., ¥m+1)’ which minimizes the linear 
function 
cy = Vm+1 
subject to the constraints 
Ai¥ — Ym+1 <0, 
—A;Y¥ — Wn+1 < —b, j=l,...,m (9.9-13) 
Ym+1 = 0 
which is of the form (9.9-11) and (9.9-12) with c = (0, ..., 0, 1)’, s,; =0, 
t; =t=m+1,s=2m and 
a; = (A;, — 1) i=1,...,m 
a; = (—A;_,, —1) i=m+1,...,s 


The L, case can be treated in a similar fashion. We are seeking a vector 
y= (v1, vaey Vn) such that 


rll, = ||[b— Ay|]. = d |b; — A;y| = minimum 
j= 


Now the residual r; = b; — A,y can be expressed as the difference, p — q, 
of two nonnegative numbers p and gq in many ways. If we consider the sum 
p + q, we see that p + q = |r,| ifand only ifp =r,;,q =Oorp=0,q= —7;. 
Furthermore, the minimum value of p + q subject to the constraints p > 0, 
q => 0 occurs when either p = 0 or q = 0. Therefore, for each j we introduce 
two new nonnegative variables y,,,; aNd Yn+n4+,; Such that 


b; — AjY = Ym+j— Vm+ntj j=l1,...,n 


Our problem then is to find a vector ¥ = (y1, ..-; Vm+2n)’ Which minimizes the 
linear function 
m+2n 
y Ve (9.9-14) 
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subject to the constraints 


Aj¥ + Ym+j— Vmtn+j = 5; J=1,...,0 
yy, = 0 k=m+4+1,...,.m+2n  (9.9-15) 


This problem is clearly of the form (9.9-11) and (9.9-12) {57}. The minimum 
is attained with either y,,4; OF Ym+n+ ; (Or both) equal to zero for each j. In 
this case, (9.9-14) is equal to )"_, |b; — A;y|, which is minimal. 


9.10 THE SIMPLEX METHOD FOR SOLVING 
LINEAR PROGRAMMING PROBLEMS 


As we have seen in the previous section, the problem of finding the solution 
of an overdetermined system of linear equations which minimizes the L, or 
L, norm of the error vector can be formulated as an LP problem. This is 
only one example of many in which LP can be applied to numerical analysis 
[Rabinowitz (1968)]. However, the principal applications of LP have been in 
industry, where it is used to determine the optimum allocation of resources 
subject to various constraints imposed by a given situation. Because of the 
economic importance of LP, much effort has been invested in building so- 
phisticated systems for solving large LP problems} on computers, and many 
algorithms have been developed to this end. Among these algorithms, the 
simplex algorithm, developed by Dantzig and his coworkers in the late 
forties, is still the one in most prevalent use, in its original form or in one of 
its many modifications. While there are deficiencies in the simplex algo- 
rithm, such as its possible numerical instability, nevertheless, for all practical 
purposes, it has proved itself over the years and is perfectly adequate for the 
solution of LP problems arising in numerical analysis. We shall discuss here 
only the basic simplex method and refer the reader to the literature for 
further details on solving LP problems. 

The simplex method assumes that the LP problem has been formulated 
in the following standard form: find x = (x,, x2, ..., x,)’ which minimizes 
the objective or cost function 


f(x) = e7X = c,X1, +2X. + °° + C4 Xy (9.10-1) 

subject to the constraints 
Ax =b>0 (9.10-2) 
x >0 (9.10-3) 


where A is an m X n matrix, m < n, and b is a nonnegative vector of dimen- 
sion m. 


f Problems involving tens of thousands of variables and thousands of constraints have been 
successfully solved with some of these systems. 
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The more general formulation of the previous section can be converted 
to standard form by the following devices. For any inequality 


Ay X41 + + AjqXq_ Sb, (9.10-4) 
introduce a slack variable x; > 0 which converts (9.10-4) to an equation 
Qj, X] +e + QA inXn + x; = b,, 


To this variable assign a cost coefficient ¢; = 0. Then replace a variable x, 
which is unrestricted in sign by the difference between two nonnegative 
variables 


Xp = Xu1 — Xx2 Xn» X~2 2 O 


Finally, multiply any equation in which the right-hand side is negative by 
—1 to yield a nonnegative b. 

Our standard problem may thus have many more variables than the 
original problem but the same number of rows. Fortunately, it is the number 
of rows which measures the amount of work required in the solution. 

In some problems, one may wish to maximize the objective function 
rather than minimize it. This 1s readily accomplished since 


max ¢7x = — min (—c’x) 


and the solution which minimizes — cx will maximize ¢e’x. 

Returning now to our standard form, we notice that (9.10-2) is an under- 
determined system of linear equations. If the rank of A is m, which we shall 
henceforth assume to be the case, then (9.10-2) has an infinite number of 
solutions in which n—m of the variables may vary freely. Suppose, for 
example, the first m columns of A are linearly independent; then partitioning 
A into a square matrix A,,,, formed from the first m columns of A and a 
matrix A,, ,-m, We premultiply both sides of (9.10-2) by A,,, to obtain 


x, = b, — Ay m+iXm4e1 ~~ Ay Xn 
cee cece ecceceeeceucceuveecteevceeceneeees (9.10-5) 
Xm = b,, — Gn,m+1Xm+1 se Qn, nXn 


where the overbars indicate the result of premultiplying A,, ,-,, and b by 


A,,). (As we shall see below, we do not have to calculate A,,, explicitly.) 
Thus, for any choice of values x*,,,..., x*, the vector x* = (xf,..., x*) 
is a solution of (9.10-2), where xf, ..., x* are calculated by substituting x*, ,, 
..., X* into (9.10-5). In particular the choice X41 = Xm+2=°°° =X, =0 
yields a basic solution x = (b,, ..., 5,,, 0, ..., 0). For every subset of m 
linearly independent columns of A, we can similarly find a basic solution. 
We now introduce the constraints (9.10-3) and call a vector x feasible if x 
satisfies (9.10-2) and (9.10-3). A basic solution of (9.10-2) which satisfies 
(9.10-3) is naturally called a basic feasible solution. A basic feasible solution 
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in which all m nonzero components are positive is called a nondegenerate 
basic feasible solution. We shall assume for the present that all basic feasible 
solutions are nondegenerate. 

Let us assume that we have found a general solution (9.10-5) in which all 
the b; > 0. Then for any nonnegative values of x,,41,.--, X, for which x,,..., 
Xm_ remain nonnegative, we have a feasible solution x = (x,, x2,..., X,)?. 
Furthermore, for x,,41 =Xm+2=''' =X, =0 we have a nondegenerate 
basic feasible solution. Substituting (9.10-5) into the objective function 


(9.10-1) eliminates the variables x,, ..., x,, from (9.10-1). We obtain 
f(X) = 20 + GueiXmer to +E Xp (9.10-6) 


where Zo = C16, + €2b, + °° + Cn Om (9.10-7) 


and where the c;, j= m+ 1,..., n, are called the reduced cost coefficients. 
For the basic feasible solution x* = (b,, ..., 5,,, 0, ..., 0) the value of the 
objective function f(x*)=z,. Furthermore, if all the reduced cost 
coefficients are nonnegative, any change in the values of the independent 
variables x,,41, ---, X, which yields a feasible solution cannot decrease f(x). 
Hence f(x) has a local minimum at x*, which can be shown to be a global 
minimum over the set of all feasible solutions {59}. Conversely, if some 
C, < 0, then, as we shall see, we can increase x, to yield a feasible solution x, 
in fact a basic feasible solution, with f(x) <f(x*) so that x* is not an 
optimal solution. Thus, we have that a necessary and sufficient condition for 
optimality of a basic feasible solution is that the reduced cost coefficients ¢, 
be nonnegative. Furthermore, if an optimal solution exists, there exists a 
basic feasible solution which is also optimal {60}. 

Assume now that in (9.10-6) some ¢, < 0. Then we can increase x, and 
so decrease f (x), and the more we increase x, the greater the decrease in f (x). 
What limits us in increasing x,? From (9.10-5) we see that a change in x, will 
cause X,,..., X,, to change. If we change x, so much that one of the variables 
x;, i= 1, ..., m, changes sign, we shall no longer have a feasible solution. 
Thus, we increase x, by the maximum amount consistent with feasibility. If 
for a particular variable x,, 1 < p < m,@,, < 0, then any increase in x, will 
not decrease x,, so that x, will always remain positive. For @,, > 0, x, will 
remain nonnegative provided that x, < 5, /@,,. Hence all x;, 1 <i < m, will 
remain nonnegative (and x,, m+ 1<i<n,i#k, will remain 0) as long as 


xX, << min —* =—+ (9.10-8) 


If we set x, = b,/@, > 0, we see that x, = 0 and we have a new basic feasible 
solution 


[b;, eee bi _, 0 are wee 5; 0: O Xk O -. QO]? 
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Since we have assumed that all basic feasible solutions are nondegenerate, 
there is a unique q such that x, = b,/a,,; otherwise one of the b; would 
vanish. Before we show how to express this new solution in form (9.10-5), let 
us see what happens if all the coefficients @,, are nonpositive. In this case, we 
can increase x, without bound, and the vector 


x’ = [bj vee bi, O-: 0 x, O :: 0)" 


always remains a feasible solution. At the same time, the objective function 
f(x’) decreases to — 00; that is, we have an unbounded solution. This usually 
indicates that the LP problem has not been formulated properly and that 
probably some constraint has been omitted. At any rate, the combination 
¢, < O and a,, < 0, p = 1, ..., m, terminates the algorithm with an indication 
that the solution is unbounded. 

We now show how to write the new solution in form (9.10-5), which will 
then allow us to start a new iteration. With q defined by (9.10-8) we rewrite 
row q of (9.10-5) as 


xX, = 


1 _ - 


a. (b, _ Qg,m+1%m+1 7 Ag, k-1Xk-1 _ Ag k+1Xk+1 — Aon Xn — Xq) 
qk 
(9.10-9) 


and substitute this value of x, into the other equations of (9.10-5) and into 
(9.10-6). This yields the new system 


/ = -—/ -—/ 
Xj = bi — Gime iXmer ~~ Fie Xk — Gee 1 Xee — 
— GinXn— Gs, 4Xq i=1,...,.q-—1,qt+1,...,mk  (9.10-10) 


and the new objective function 


—_ -/ -/ =- - 
F(X) = Zo + Gt iXmer He + Ce Xeaa + Cee Mea H+ Xn + COX, 


where 
_  _ Gi, Gy hi = _ Gb, i=1,...,m; ix q 910-11 
a i = 9; — . l idk (9. - ) 
Qgx Qo J=mt+ 9 Fe og n; j# 
_ _ Cy.a _ Ait, Qa _ 1 
a a 
Ook qk qk qk 
_ b, c,b. 
c= bi = —4 zy = 2) +4 
Qo Qo qk 


Since ¢, < 0, 5, > 0, and @,, > 0, we see that z) < 2g, So that each iteration 
decreases the value of the objective function. 

We shall now give the algorithm for a general step, at the same time 
simplifying the notation. 
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Let N = {1, 2, ..., n} be partitioned into two sets, N = D UI, where 
D = {P,,..., P,,} is the set of indices of the dependent or basic variables and 
I ={P,,+1,---, P,} is the set of indices of the independent variables. In the 
following, any index i used refers to P;, i= 1,..., n. Thus x, refers to xp, a 
basic variable, etc., and we can describe the initial state of each iteration by 
(9.10-5) and (9.10-6). We now simplify the notation a little by writing @;, in 
place of b;, Go; in place of Z;, Go in place of — zo, and x in place of —f (x). 
Thus (9.10-5) and (9.10-6) are combined into the system 


X; = ain — y QA; ;X; i= 0, coe M (9.10-12) 
j=mt+1 


The matrix A = [G,)], i =0,...,m,j =m + 1,...,n, together with the vector 
@,) and the permutation vectors D and I form what is called the tableau (see 
Fig. 9.1). The simplex methods proceeds as follows: 


1. Compute min,+41<j<n4oj =o, (to yield the most negative ¢,). In 
case of ties, choose the first index k. If G, > 0, terminate with the basic 
solution X with positive components xp, = @jo, i= 1, ..., m, and zero 
components xp_,, ---, Xp,. (Recall our assumption of nondegeneracy.) 
The optimal value of the objective function is —Go. 

2. If do, < 0, compute ming, 5 (Zio /Gx) = Ago /Ay, [cf. (9.10-8)]. If all a, < 0, 
terminate with an indication that the solution is unbounded. 

3. Introduce P, into D in place of P, to yield D’ and replace P, by P, in I to 
yield I'. Corresponding to (9.10-11), compute 


Ay Ay i=0,....m;i#q 
ai; = aj; —_- TZ . . 
Gar jJ=O0O,m4+1,...,n;j#k 
- - ; (9.10-13) 
= Qi = aj = 
an z= a; =a a. = 
ik Gan aj Ta qk Gi 


and return to step 1 with an updated matrix A’, vector a) and permutation 
vectors D’ and I’. 


This algorithm is guaranteed to terminate in either step 1 or 2, since 
each iteration reduces the value of —@p,. and hence any particular partition 
(I, D) can appear only once while the number of distinct partitions of N of 


Figure 9.1 The tableau of a linear programming problem. 
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the form N =I vu D equals (",,”). In practice the number of iterations 
required to terminate is of the order of 2m. 

There remains the problem how to start the process, namely how to get 
a representation in form (9.10-5). That this is not always possible can be 
seen from the following trivial example. 


Minimize x, + x, subject to —x,; —X,=1;x,,x,20 (9.10-14) 


This is a case where there is no feasible solution to the problem. Hence, we 
must find an algorithm to find a basic feasible solution if one exists and to 
terminate with an indication of infeasibility if there are no feasible solutions. 
Fortunately, we can use the same simplex algorithm applied to a related 
problem. By doing so we shall find a basic feasible solution for our problem 
and, at the same time, formulate the problem in the form (9.10-5), without 
having actually computed 4,,,| and premultiplied A by it. 

Let us introduce a vector y = (y,, ..., Ym)’ Of so-called artificial var- 
iables, and consider the LP problem 


Minimize } y; (9.10-15) 
i=1 
subject to the constraints 
Ax+1,y=b (9.10-16) 
x,y>0 (9.10-17) 


where I,, is the unit m x m matrix. This problem has a basic feasible solution 
y = b,x = 0, so that we can apply the simplex algorithm. Since the objective 
function is bounded below by 0, there cannot be an unbounded solution, so 
that the simplex algorithm must terminate with an optimal solution. Now if 
in this solution any component of y is positive, it means that there is no 
feasible solution to the problem (9.10-1) to (9.10-3). For if a feasible solution 
x existed, then (x, y) with y = 0 would be a feasible solution of (9.10-15) to 
(9.10-17) with the value of the objective function (9.10-15) equal of 0. Since 
the algorithm terminated with some y,;>0, so that the value of 
(9.10-15) > 0, this means that a solution of the form (x, y) with y = 0 to 
(9.10-15) to (9.10-17) does not exist; i.e., no feasible solution exists for the 
problem (9.10-1) to (9.10-3). If every component of y vanishes, we have an 
initial basic feasible solution consisting of the m positive components of x in 
the optimal solution. (Recall our assumption of nondegeneracy.) With this 
solution we can compute the reduced cost coefficients and thus set up our 
tableau and apply the simplex algorithm until termination. Moreover, the 
solution of (9.10-15) to (9.10-17) will automatically result in a tableau with 
Eqs. (9.10-2) in the form (9.10-5). (See also Example 9.6, below.) Thus we 
have two phases in our problem, phase I, in which we find a feasible solution 
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if one exists, and phase II, in which we find an optimal solution or determine 
unboundedness. 

We conclude with a word about degeneracy. There are two aspects to 
degeneracy, a theoretical side and a practical side. In theory, the simplex 
method need not terminate in the presence of degeneracy and may cycle. 
However, by the use of a perturbation argument [see Dantzig (1963)] it can 
be shown in step 2 of the simplex method, when there may exist several rows 
such that G,, /a;, are equal to the minimum, thus leading to degeneracy, that 
a choice procedure exists for choosing the proper row to avoid cycling. The 
simplex method so modified will terminate even in the presence of degener- 
acy. In practice, degeneracy is just ignored, and in step 2 an arbitrary choice 
is made in the case of a tie. There have been almost no cases of cycling 
reported in practical problems, even though degeneracy is very common. 


Example 9.6 Solve the following LP problem using the simplex method. 
Minimize f(x) = x, + 6x, — 7x, +X4+ 5x, (9.10-18) 
subject to the constraints 


5x, — 4x, + 13x, — 2x4 + x, = 20 
X,—-X,+ 5x3—-x,+x,=8 (9.10-19) 


all x, >0 


Since there is no obvious initial basic feasible solution, we shall apply phase I to find 
one. To this end we set up the following problem which we solve first: 


minimize g(x) = X_ + x4 (9.10-20) 
subject to the constraints 
5x, — 4x2 + 13x, — 2x4 + X5 + X,_ = 20 
X,—X_,+5xX3;—XgtXstx,=8 (9.10-21) 
ali x, >0 


The initial basic feasible solution to this problem is x, = 20, x, = 8. The correspond- 
ing tableau is 


where we shall circle the pivot element in each tableau. This element is determined by the 
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rules of the simplex method. Applying the transformations (9.10-13), we get the following 
two tableaus: 


Since all elements ay; > 0, i > 1, we have reached an optimal solution of phase I. This 
solution, x, = 3, x5 = 4 is a basic feasible solution of (9.10-19), as we readily verify. Note 
that the final reduced cost coefficients are precisely the original coefficients of (9.10-15) 
because this last tableau corresponds to a premultiplication of (9.10-16) by A>’, which 
must leave (9.10-15) unchanged since the solution variables have 0 cost coefficients in the 
objective function. 

We now are ready to start phase II and return to (9.10-18) and (9.10-19) to set up the 
following tableau: 


All the rows but the first arise from the final tableau of phase I by deleting the columns 
corresponding to the artificial variables x, and xz. To get the first row, we insert the 
values of x, and x, into (9.10-18), where, from the tableau, 


x,=43-—4x, + 3x. — 3% (9.10-22) 
Xs =4+ 4x, — gx. + Bx, 
so that (9.10-18) becomes 
Sf (x) = —8 + 12x, —x,+2x, (9.10-23) 


The next (and final) tableau follows by applying the simplex method as before 


From this, we read off the solution x, = 5,x; = 12 x j= 0,j = 1,4, 5. For this solution, the 
value of f(x) is —%¥. 
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9.11 MISCELLANEOUS TOPICS 


Complex matrices The most efficient way to solve the system of linear equa- 
tions with complex elements 


Cz =w (9.11-1) 
with C=A+iB z=x+iy w=u+iv (9.11-2) 


is to follow the real algorithm, replacing all real operations by complex ones. 
This can readily be accomplished in a programming language which allows 
for the declaration of complex variables. In the absence of such a facility, it 
can become tiresome to convert every real operation into a series of such 
operations. In this case, one may be willing to pay the cost and convert the 
problem to one involving real numbers. Inserting (9.11-2) into (9.11-1) and 
equating real and imaginary parts of both sides, we get the real system {62} 


Ax — By=u Bx + Ay=Vv (9.11-3) 


This is a 2n x 2n system requiring O(4n’) storage locations, as against 
O(2n7) for the complex system. Similarly, the amount of computation is 
O(8n*/3) multiplications as against O(4n°/3) real multiplications for the 
complex case. Thus, if complex systems arise infrequently, it may not pay to 
invest the programming effort in writing a special program for complex 
matrices. However, if they appear frequently and the language in use does 
not accommodate complex variables, the superiority of the complex formu- 
lation in both storage requirements and computation time calls for writing a 
separate program for complex systems. 


Determinants Whereas determinants are usually thought of as an aid in 
solving systems of linear equations via Cramer’s rule, there are other situa- 
tions in which the value of the determinant of a matrix A is required, e.g., in 
some methods for solving eigenvalue and generalized eigenvalue problems. 
As another example, the determinant of the Jacobian of a transformation is 
needed when a change of variable is made in a multiple integral. The most 
efficient way to compute the value of det (A) is via the LU decomposition of 
A. From (9.3-23) we have that 


A=P-'LU 
from which 
det (A) = det (P~*) det (L) det (U) = det (P~")[] uy (9-1-4) 
Recalling that 
P=P,., °° Pay, (9.11-5) 
and that det (P,,.)=—-1 ifkéy, (9.11-6) 
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we have that det (P) = det (P~') = N, the number of row interchanges. Thus 
finally, 


det (A) = (--1)"% LT es (9.11-7) 


Band matrices A band matrix of bandwidth [w,, w2] is a matrix for which 
a,;=0 if i—j>w, or j—i>w,. Thus, upper and lower triangular 
matrices are band matrices of bandwidth [0,n — 1]and [n — 1,0], respectively. 
Other important examples of band matrices are the tridiagonal matrices 
with bandwidth [1, 1] and the upper and lower Hessenberg matrices, which 
are defined to be matrices of bandwidths [1,n — 1]and [n — 1,1], respectively. 

The importance of band matrices lies in the fact that in the LU decom- 
position (9.3-23) of a band matrix A, the bandwidth of L is always [w,, 0] 
{64}. Furthermore, in the absence of pivoting (as in the case of symmetric 
positive definite matrices or diagonally dominant matrices) the bandwidth 
of U is [0, w,] {64}, so that in this case, if the elements of L and U are 
overwritten on those of A, no additional storage space is required. And even 
in the case where pivoting is required, it is easy to see {64} that the band- 
width of U is at most [0, min (w, + w,, n — 1)]. Thus, for upper Hessenberg 
matrices, the same situation holds as above, while in the general case, addi- 
tional storage is required but usually much less than for a full matrix. (The 
lower Hessenberg matrices are an exception to this rule; however, see {69}.) 
In addition, the computation time is much smaller for certain band matrices. 
In the important case of upper Hessenberg matrices, we need O(n’) multipli- 
cations for LU decomposition while for general band matrices for which 
W, + wz <n we need only O(n) multiplications. A similar reduction in the 
number of operations also occurs in the forward and back substitutions {67}. 

Of particular interest is the triangular decomposition of a tridiagonal 
matrix without row interchanges, which arises in many situations in numeri- 
cal analysis, e.g., in the computation of cubic splines (cf. Sec. 3.8). The algor- 
ithm for the complete solution of a tridiagonal system is very simple and 
elegant in this case and proceeds as follows. 

The matrix A can be represented as three vectors a, b, and c consisting of 
the subdiagonal, diagonal, and superdiagonal elements, respectively. [For 
uniformity of notation a = (0, a,,..., a,)” and ¢ = (cy, ..., C,—1, 0)". 

In the LU decomposition of A, L has bandwidth [1, 0] and U, [0, 1]. 
Furthermore it turns out, as can be verified by direct multiplication, that the 
superdiagonal of U is equal to that of A. It thus remains to compute « the 
subdiagonal of L and B, the diagonal of U. The formulas for this are given by 
{65} 


a, 


Bu-1 


By=b) & = By = by — Oy C-1 =k =2,3,...,n (9.11-8) 


THE SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS 467 


Once we have the triangular decomposition, we can solve the system Ax = d 
by forward and back substitution as follows: 


0,=d, 0, =, — &, OK 1 k = 2, 3, ...,0n (9.11-9) 


_ On On — CeXe 41 
Yn BR rn, 
n k 


Throughout this process, the vectors a, B, 5 can overwrite a, b, d and the 
solution x can in turn overwrite 6. The total number of operations is seen to 
be 3n — 3 additions, 3n — 3 multiplications, and 2n — 1 divisions, which is a 
very satisfactory result. 

When pivoting is required, the algorithm is not as elegant and requires 
an additional vector for storage. The number of operations also increases to 
at most 4n — 5 additions, 5n — 6 multiplications, and 2n — 1 divisions, which 
is still quite satisfactory {66}. 


k=n—1,...,1 (9.11-10) 


Example 9.7 Solve the following tridiagonal system using 5-digit floating-point decimal 
arithmetic. 


136.01x, + 90.860x, = — 33.254 
90.860x, + 98.810x, — 67.590x, = 49.790 
— 67.590x, + 132.01x, + 46.260x, = 28.067 
46.260x, + 177.17x, = —7.3244 
We have that 
a=(0 90.860 -—67.590 46.260]’ 
b = [136.01 98.810 132.01 177.17]" 
c = [90.860 -—67.590 46.260 O]7 
d = [—33.254 49.790 28.067 —7.3244]" 


Using (9.11-8), we compute o and 6 as follows: 


90.860 
B, = 136.01 a, = = 66804 B, = 98.810 ~ 90.860, = 38.112 
1 
67.590 
ay = — —G— = ~ 17735 B = 132.01 + 67.590, = 12.139 
2 


46.260 
X= = 3.8109 By = 177.17 — 46.260, = .87777 


3 


We then compute 6 as follows, using (9.11-9), 


5, = 33.254 5, = 49.790 — a6, = 72.005 
5 = 28.067 —a,5, = 155.77 5, = —7.3244 — 0,6, = — 600.95 
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Finally, we compute the solution x, using (9.11-10), 


3 —_ 46.260x, 


6 
X,=— = -68463 x,= = 2621.9 
Ba B, 
6, — (—67.590 — 90.860 
a ee =4651.7 x, ae = —3105.7 
2 1 


In Example 9.3, we computed the LU decomposition of the matrix of this system. If we 
now solve the system 


LUx = Pd 
by forward and back substitution, using the matrices L, U, and P of Example 9.3, we find 
that 
& = [—2953.3 4420.5 2491.5 -—650.59]’. 


This solution differs considerably from that computed above. However, if we solve the 
tridiagonal system using 10-digit arithmetic, the solution X rounded to 5 digits is 


% = [-2957.4 4426.6 2495.0 —651.49]" 


which is much closer to x than to x. 
The reason for this discrepancy is that the matrix A is ill-conditioned. If the inverse 
matrix A~' is computed, we have that 


22.229 —33.264  —18.746 4.8948 
— 33.264 49.793 28.062 —7.3271 
— 18.746 28.062 15.823 —4.1315 

4.8948 —7.3271 —4.1315 1.0844 


Al = 


so that ||A~ ‘||, ~ 118. Since ||Al|,, ~ 257, we have that K,,(A) > 30,000. Thus if we 
assume that x is the exact solution of (A + 6A)x =d, where ||5A]|,, ~ 4 x 10° *|/Al],,, 
which is one unit of roundoff error, we have from (9.4-23) that 


x,-—x Aq" 5A 
Ik elle ¢ 1A Ale 4g 
IX, |] 0 1— |Aq*]|.. |All]. 


|k— xl], 225.1 


and indeed = an 
Ix}]0 4426.6 


BIBLIOGRAPHIC NOTES 


The literature on the solution of simultaneous linear equations is vast. A classification of this 
literature together with an exhaustive bibliography up to 1952 is given in Forsythe (19535). 
More recent extensive bibliographies can be found in Faddeev and Faddeeva (1963) and 
Householder (1964); see also Westlake (1968) and Stewart (1973). Bibliographies on iterative 
methods appear in Varga (1962) and Young (1971). 

For the material of this chapter on direct methods, we have drawn heavily on the work of 
Wilkinson (1960, 1961, 1964, 1967). For the use of iterative techniques in the solution of partial 
differential equations, a good source is Varga (1962). Other good general sources for the 
material of this chapter are Bodewig (1959), Faddeeva (1959), and Fox (1965). Householder 
(1964) presents an extensive survey of the theoretical aspects of the solution of linear systems. 

A comprehensive survey of the various methods for solving systems of linear equations is 
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given in Westlake (1968). An interesting attempt to combine the theory, applications, and 
numerical aspects of linear algebra in one framework appears in Noble (1969). A collection of 
algorithms implementing many of the methods treated in this chapter appears in Wilkinson and 
Reinsch (1971). 


Section 9.1 The material in this section is classical; see, for example, Birkhoff and 
MacLane (1953) or Faddeev and Faddeeva (1963). 


Section 9.2 Discussion of some of the matters considered here can be found in Bodewig 
(1959), Newman (1962), Forsythe (1953a), and also in Modern Computing Methods (1961). 


Sections 9.3 to 9.5 The material in these sections is based principally on Wilkinson (1967), 
which is the source of Example 9.0. Other good references are Forsythe and Moler (1967) and 
Stewart (1973). 


Section 9.3 Gaussian elimination and the Gauss-Jordan reduction are discussed in any 
source that considers direct methods; see, for example, Bodewig (1959) or Hildebrand (1974). 
The Doolittle algorithm is discussed in Modern Computing Methods (1961) and by Bodewig 
(1959), that of Crout (1941) in Hildebrand (1974), and that of Cholesky in Schwarz et al. (1973). 
Bodewig (1959) discusses the number of operations required for various methods. 


Section 9.4 Fraenkel and Lowenthal (1971) discuss a method for finding the exact solution 
of a system of linear equations with rational coefficients and give a Fortran program which 
implements this method. 

The approach to error analysis based on bounding the roundoff accumulation from stage 
to stage is exhaustively discussed in the now classical papers of Von Neumann and Goldstine 
(1947) and Goldstine and Von Neumann (1951). Our approach here is due to Wilkinson (1964), 
which is a mine of information, and Wilkinson (1960). Wilkinson (1961) discusses the errors in a 
number of direct methods in terms of both fixed- and floating-point arithmetic. Bodewig (1959) 
discusses errors at some length. 


Section 9.5 Our approach to finding improved solutions of ill-conditioned equations will 
be found in Modern Computing Methods (1961); Bodewig (1959) also discusses ill-conditioned 
equations. The pitfalls of iterative refinement are vividly illustrated by Kahan (1966). 


Section 9.6 The most complete coverage of iterative methods is that in Young (1971), 
which also contains an extensive bibliography. Iterative methods are discussed by many other 
authors; see Varga (1962), Bodewig (1959), Faddeeva (1959), Newman (1962), and Sheldon 
(1960). The paper by Martin and Tee (1961) contains a useful survey of iterative methods. 


Section 9.7 The Jacobi and Gauss-Seidel iterations have been widely analyzed. Varga 
(1962) considers both; the Gauss-Seidel method is discussed by Van Norton (1960). An excel- 
lent source on the acceleration of stationary iterative processes and their use in partial differen- 
tial equations is Varga (1962); see also Sheldon (1960). 


Section 9.8 Thorough discussions of matrix inversion are given by Bodewig (1959). For 
other techniques than we have presented here, see Wilf (1960), Householder (1953), and West- 
lake (1968). 


Section 9.9 An extensive treatment of the problem of solving overdetermined systems of 
linear equations in the least-squares sense is given in Lawson and Hanson (1974), which also 
contains a collection of Fortran programs implementing the theory; see also Stewart (1973) and 
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the contribution by Businger and Golub in Wilkinson and Reinsch (1971). Another approach to 
the problem via the singular-value decomposition of a matrix is also discussed in Lawson and 
Hanson and in the contribution by Golub and Reinsch in Wilkinson and Reinsch (1971). 

The application of linear programming to the solution of overdetermined systems of linear 
equations in the L, and L,, norms is discussed by Rabinowitz (1968), who describes many 
applications of linear programming in numerical analysis. 


Section 9.10 The simplex method was developed by Dantzig and his coworkers in the late 
forties and is treated in great detail in Dantzig (1963). Our treatment is based on that of 
Luenberger (1973), which is also the source of Example 9.6. An algorithm implementing a 
more stable variation of the simplex method is given in the contribution of Bartels, Stoer, and 
Zenger in Wilkinson and Reinsch (1971). 


Section 9.11 Algorithms for the solution of complex systems of linear equations using the 
Crout factorization are given in the contribution by Bowdler, Martin, Peters, and Wilkinson in 
Wilkinson and Reinsch (1971). The accurate evaluation of determinants is discussed in Forsythe 
and Moler (1967). Algorithms for the solution of symmetric and unsymmetric band equations 
are given in two contributions by Martin and Wilkinson in Wilkinson and Reinsch (1971). 
The algorithm for the solution of a tridiagonal system appears in Dahlquist and Bjérck (1974) 
and in many other references. 
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PROBLEMS 


Section 9,1 


1 Use Gaussian elimination to prove Theorem 9.1 and its corollary. 


Section 9.2 


2 (a) Is the matrix of the coefficients in the normal equations of polynomial least-squares 
approximations usually filled or sparse? Explain your answer. 

(6) By considering simple difference approximations to derivatives, explain why the 
numerical solution of partial differential equations often results in systems of equations with 
sparse coefficient matrices. 


3 The Fibonacci sequence is generated using the difference equation 


Sp=fi-rth-2 J>1'l fo=0 f,=1 


(a) Show that f, fi... —f74, =(-1)"*',n=0,1,... 
(6) Thus find the unique solution of 


SX. thei Xe =Sat2 Saati % thi+2%o =Snes 


(c) How do you know that the system in part (b) becomes increasingly ill-conditioned as n 
increases ? 

(d) In particular let n = 10 in part (b) and replace f, , , in the second equation by f, , , + €. 
Calculate the solution for « = .018 and ¢ = .02. For what value of € does the solution not exist? 
[Ref.: Miles (1960).] 


Section 9,3 


4 Verify Eqs. (9.3-13) and (9.3-14). 

5 Let A be an m x n matrix. 

(a) Show that if P,, is an m x m matrix defined by (9.3-15), then P,, A is the matrix A with 
rows i and j interchanged. 

(b) Show that if P;; is an n x n matrix defined by (9.3-15), then AP,, is the matrix A with 
columns i and j interchanged. 

(c) Use part (a) to show that P?, = I. 

6 Show that if L, is defined as in (9.3-16), then with L;' = E, we have |,, = 6,;, j # i, 
Li=O.k<ih;= my, k >i. 

7 (a) Prove that the product of two (unit) triangular matrices is a (unit) triangular matrix 
of the same form. 

(b) Show that if L is a unit lower triangular matrix such that [,, = 0 for j < i,j # k, and L, 
iS defined as in (9.3-16), then L,;L= L where hj = lejoJ # l, he = leis k < i, Li = —-M;, k > i. 

(c) Hence verify (9.3-22). 

8 (a) Show that if r, =k >i and L; is defined as in (9.3-16), then for any matrix B, 
P,,,, L;B = (Px, +, LiPs,,,)Px.-,8 where P,,,,L;P,,,, has the same form as L;. 

(b) Hence, show directly that (9.3-21) follows from (9.3-19). 

9 (a) Use (9.3-7) to derive 


k-1) 0) (1) cee k-2 - os 
al =a, — ma‘) — M240; — Mi, k-1 af?) i, jak 


(0) — 
aij =a; 
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(b) By making the identifications m,,=1,, i>r and a’, ")=u,,, j> 7, show that the 
equations in part (a) for i, j = k =r are identical with (9.3-34) and (9.3-35). 

(c) Hence, conclude that if (9.3-34) and (9.3-35) are computed using single-precision 
arithmetic, the results are identical with those arising from Gaussian elimination. 

10 Show that if we define |, =u,,l,, i=, ..., m (recall that /,, = 1), and i,, = u,,/u,,, 
j=r,...,n, then (9.3-34) and (9.3-35) go over to (9.3-38) and (9.3-37), respectively, and hence 
that], =u,,,r=1,...,n. 

1 (a) Verify (9.3-43). 

(b) Compute the exact number of multiplications and divisions in the Cholesky factoriza- 
tion of a positive definite matrix. 

(c) Repeat part (b) for the factorization (9.3-47). 

12 (a) Use (9.3-47) to show that (9.3-48) and (9.3-49) follow from (9.3-45) and (9.3-46), 
respectively. 

(b) Assurne that we have a decomposition of A,_, in the form L,_,D,-,17_,, where 
L,_ , is unit lower triangular and D, _ , is a diagonal matrix with positive elements. Show that a 
decomposition of A, exists of the form 

_ppor [Er 0} p_ [B,1 0 
A, =L,D,0 where L, = le 1 D= r a d,>0 
and exhibit the values of €,_, and d,. 

13 (a) Show that if we apply Gaussian elimination without pivoting to a symmetric 
matrix A, then m;, = a,;/Q,;. 

(b) Deduce from this that A“) is symmetric and consequently that with m, = afi” ?/ali” 
all A® are symmetric. 

(c) Show that the computation required is almost halved compared with the nonsymmet- 
ric case. 

(d) Use this simplification to solve the system 


6428x, + .3475x,— .8468x3 = .4127 
.3475x, + 1.8423x, + .4759x, = 1.7321 
~—.8468x, + .4759x, + 1.2147x, = — .8621 


14 Let A be a symmetric positive definite matrix of order n and M a nonvoid subset of 
N = (1, 2, ..., n}. By considering the set of vectors x =(x,, x2, ..., x,)’ Such that x,=0, 
i € N — M, show that the submatrix Ay = (a,;|i,j € M) is positive definite; ie., show that every 
principal minor of A is positive definite. 


15 Consider the system 
4096x, + .1234x, + .3678x3 + .2943x, 


2246x, + .3872x, + .4015x, + .1129x, = .1550 


: 


.3645x, + .1920x. + .3781x3 + .0643x,4 = .4240 
.1784x, + .4002x, + .2786x, + .3927x, = —.2557 


(a) Solve this system without pivoting by Gaussian elimination, carrying four decimal 


places throughout. 
(b) Solve the system of part (a) using partial pivoting and compare the results with those 


of part (a). 
*16 Let A, be the matrix of order n 


(—1)'ts-! i>j or j=n 


a; = 1 izn ay, = i<j<n 


474 A FIRST COURSE IN NUMERICAL ANALYSIS 


(a) Show that the elements of the inverse of this matrix are given by 


a,;'=27! ixn a,j) =(—1pr*F 42-4 jen 
an} = 0 i>j;ign apt =(-1""ti-1Q-"+§ jan 
ij 

=(-1)*F12-00 FD i<j;j#n a7} = —27"*! 


(b) Write out A, and A;'. 

(c) Why is A, very well-conditioned? 

(d) Consider A3, and let B,, be the matrix formed by replacing a3, 3, by —4. Show that 
B;,' = Aj; — 4 (last column of A;,') (last row of Aj,')/(1 — 273"), and thus deduce that some 
of the elements of A;,' differ from those of B;;' in the first decimal. 

(e) Show that, if partial pivoting is used, Gaussian elimination applied to A,, and B,, is 
identical until the final step; i.e, show that the thirty-first row never has the element with 
greatest magnitude in a column. 

(f) Show that the final element in the triangle should be —2°° for A,, and —2°° +4 for 
B,,. Thus deduce that if fewer than 30 binary digits are used in the computation, Gaussian 
elimination applied to A,, and B;, leads to identical L and U matrices, and thus to identical 
inverses [or solutions of (9.1-1)] in spite of the fact that the true inverses differ substantially. 

(g) Why would complete pivoting avoid this difficulty? [Ref.: Wilkinson (1961), 
pp. 327-328.] 


*17 (a) Let A in (9.1-1) be symmetric and positive definite and let B be the matrix of 
coefficients in (9.3-2) excluding the first row. Show that 


n n n a; 2 n n a) 
aii — Ai [ Xt 2 xP => daly xx, 


i=1 j=1 i=2411 i=2 j=2 


(b) Deduce from this that B is positive definite and therefore that the matrices of all the 
derived systems are positive definite. 
(c) Further, show that 


ai <a, i=2,...,n 
and thus deduce that 


max |al?|< max |a,,| 
2si,jsn 2si,jsn 


(d) From parts (b) and (c) deduce that if |a,;| <1, then |af}| <1 for all k. 
(e) By giving a 2 x 2 example, show that with no pivoting some of the multipliers m,, may 
be very large. [Ref.: Wilkinson (1961), pp. 285-286.] 


18 (a) Solve the system 
.2641x, + .1735x, + .8642x, = —.7521 
9AL1x, — .0175x, + .1463x, = .6310 
— .8641x, — .4243x, + O711x, = .2501 


using Gaussian elimination with (i) no pivoting, (ii) partial pivoting, (iii) complete pivoting. 
(b) Repeat part (a) for the symmetric system 


AT21x, + .2352x, — .2613x, + .8421x, = —.2317 
.2352x, + .7411x, — .0463x, + .1569x, = .3219 
— .2613x, — 0463x, + .8955x, + .1748x, = .6217 


9835 


8421x, + .1569x, + .1748x, + .9841x, 
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19 (a) Consider the equilibrated system Ax = b, where 


e —1 1 
A= - 1 | le] <1 
1 | 1 
0 -2 2 
Verify that Atl= -2 l-—e Ll+e 
2 ite l-e 


(b) Use (1.3-16) to show that K,,(A) = 12 so that A is a well-conditioned matrix [See 
page 431 for the definition of K(A)]. 

(c) Make the substitutions x = x,/e, x, = x3/e into the system Ax = b and equilibrate 
the resulting system to yield A’x’ = b’. Show that 


1 -l 1 
A= |-1 € € 
1 € € 


1 1 
0 __ _ 
2 2 
A-'= | -- 
2 4e 4e 
1 l+e IL-e 
4e 4e 


2 

(e) Use (1.3-16) to show that K’,(A’) = 3 + 3/2e, so that A’ is an ill-conditioned matrix 
even though 4’ is equilibrated. 

(f) Show that the result of solving A’x’ = b’ with partial pivoting or (if € < 0) with complete 
pivoting is equivalent to solving Ax = b without pivoting. 

(g) For « = —10°° and b = (1, 1, 3)" and using 4-digit floating point decimal arithmetic 
with rounding, solve the system Ax = b without pivoting. 

(h) Repeat part (g) with partial pivoting. 

(i) Repeat part (g) for the system A’x’ = b’ and compare the result transformed by the 
equations x, = x}, xX. = €x,, x3 = €x5 with the results of parts (g) and (h). [Ref.: Dahlquist 
and Bjérck (1974), p. 182, and Fox (1965), p. 169.] 

20 (a) Give formulas corresponsing to (9.3-28) and (9.3-8) to solve the system LOx =b, 
where L is a nonsingular lower triangular matrix and U is unit upper triangular (the Crout 
case). 

(b) Repeat part (a) for the system LL’x = b (the Cholesky case). 

(c) Repeat part (a) for the system LDUx = b, where L and U are unit triangular and D is 
diagonal and nonsingular. 

(d) Specialize part (c) to the system LDL’ x = b arising from (9.3-47). 


(d) Verify that 


—" 
—_— 
| 
fon) 
—_— 
+ 
om 


Section 9.4 


21 (a) Show that for any matrix norm ||I|| > 1, where J is the identity matrix. 
(b) Show that for any matrix norm ||A||- |A7'| >1 


22 Verify (9.4-10). 
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23 Diagonally dominant matrices. (a) If A is diagonally dominant, that is, 
yre1 sei [Gl < [au|, i= 1,...,n, show that ay #0,i=1,..., n. 

(b) Let A = D~'B, where D = diag (a,;). Show that B = I — C, where ||C||, < 1, so that 
by Corollary 9.3, B and hence A are nonsingular. 

(c) Show that if we apply Gaussian elimination to a diagonally dominant matrix A, all the 
elements al” ') + 0, so that in theory, pivoting is never necessary. 


24 Let A be a matrix such that A’ is diagonally dominant, i.e. such that )"_, 4; |4@,;| < 
|a,;|,j=1,..., n and apply Gaussian elimination without pivoting. 

(a) Show that )'7_, |m,,| <1. 

(b) Show that '7_2 |alp| < Y7., lal?|, aff =aj,j=2,....n. 

(c) Show that jai?) > tu. i+, lal |, j= 2, .... a, so that A is also diagonally 
dominant. 

(d) Repeat the arguments above to show that }7_,,, |a{?| < )7., jaf |,r=2,..., 
n— 1, and hence that |af| < )7., [aff yn. 

(e) Hence conclude that if A is column diagonally dominant, i ie., if A? is (row) diagonally 
dominant and we apply Gaussian elimination without pivoting, the growth factor g < 2. 
[Ref.: Wilkinson (1961).] 

25 (a) Verify (9.4-40) and (9.4-41). 

(b) Derive from (9.3-8), a formula similar to (9.4-43) and use it to verify (9.4-46) and 
(9.4-47). 

(c) Derive formulas similar to (9.4-43) and that of part (b) for the case of the forward and 
back substitution formulas connected with the Crout algorithm given in Prob. 20. 

(d) Derive formulas analogous to (9.4-44) to (9.4-51) for the Crout case. 

(e) Verify (9.4-53). 

(f) Verify (9.4-54). 


Section 9.5 


26 Verify (9.5-4). 
27 (a) Using 4-digit floating-point decimal arithmetic with chopping (not rounding), 
obtain a solution correct to three significant figures to the system 


.2885 


8647x, + .5766x, 


.4322x, + .2882x, = .1442 


by using iterative refinement. 

(b) Repeat part (a) using 9-digit arithmetic, either chopped or rounded. 

(c) Compute the exact residual vectors for the solutions obtained in parts (a) and (b) and 
thus verify that the solution obtained in part (b) is the exact solution. 

(d) Account for the failure of the iterative refinement process in part (a). [Ref.: Kahan 
(1966). 


28 Consider the system 
.23 
32 


05x, + 07x, + 06x, + 05x, 


O7x, + 10x, + 08x, + 07x, 


06x, + 08x, + 10x, + 09x, = .33 
05x, + 07x, + 09x, + 10x, = 31 


(a) What is the true solution of this system? 
(b) Solve this system using the Doolittle method with partial pivoting. 
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(c) Repeat part (b) using the Crout method with partial pivoting. 

(d) Repeat part (b) using the LDL’ decomposition. 

(e) Apply iterative refinement for each of the above methods until the solution is correct 
to four significant figures. 


Section 9.6 


29 (a) Show that the matrix iteration B,,, = B,2/ — AB,), B, arbitrary, to find A™! is 
the matrix analog of the Newton-Raphson method for finding the reciprocal of a number. 

(b) Defining C; = I — AB,, show that C, = C?'~'. 

(c) Thus deduce that a sufficient condition for the convergence of the iteration of part (a) 
to A~' is that all the eigenvalues of C, lie within the unit circle. (This condition is also 
necessary.) 

(d) How could this iteration be used to solve systems of linear equations? Do two itera- 
tions on the system of Example 9.4 with B, = 4/. What is bad about this method from the 
computational point of view? [Ref.: Newman (1962), pp. 223-226.] 


30 (a) Let x, be a solution of Bx = 0, where det (B) = 0, and let x‘ be the component of 
x, of largest magnitude. Show that 


n 
[oul < D [bu 
k=1 
k Fi 
(b) Use this result to prove that 
Jamul <max Jay] [aml 2 min Iau] — 5 [aul 

i k=l i k=1 
k#i 


where Arnax and A,,,, are the eigenvalues of A = [a,,] of maximum and minimum magnitude, 
respectively. 

(c) Thus deduce that the Jacobi iteration for a matrix A with diagonal terms equal to 1 
converges if 


YY |au| <1 i=1,....n  orif Y jay| <1 i=1,...,n 
ive fe 
Section 9.7 


31 Use (9.7-9) to derive Eqs. (9.7-10) and (9.7-11). 
*32 In the Gauss-Seidel method let x‘), be the kth component of x,,, and let 


k-1 n 


k) ) j) 
rH), = b, — Yay xP? — > aX? 
j=l jak 
(a) Show that 
ri), 
ky) __ Uk i 
xf), = xf) + — 
Any 
(b) If €; = x, — x,;, show that 
re k-1 n 
k) _ kk i+ 3 j) d) 
ef), =f —-—— and Mi = ) aj, + >, a, ;€ 
Oy j=l j=k 


478 A FIRST COURSE IN NUMERICAL ANALYSIS 


(c) Let A be symmetric and consider the quadratic form Q(e,) = €7 Ae,. Show that 


Q(€;+1) — Q(e;) = -y (ria) 


(ad) From this deduce that if A is a nonsingular symmetric matrix with positive diagonal 
elements, and if the Gauss-Seidel method converges for any x,, then A must be positive definite. 
[Ref.: Van Norton (1960).] 


33 (a) Let Land U be, respectively, lower and upper triangular nonnegative matrices with 
zeros on the main diagonal. Prove that (J — L)~ 'U is nonnegative. 

(b) Thus deduce that if A is a matrix with 1s on the main diagonal and nonpositive entries 
otherwise, both B, and B,, the Jacobi and Gauss-Seidel iteration matrices, are nonnegative. 
[Ref.: Varga (1962), p. 69.] 

34 (a) By calculating the spectral radius of the B matrices for the Jacobi and Gauss-Seidel 
iterations in Example 9.4, show that the Gauss-Seidel method would be expected to converge 
substantially faster than the Jacobi method. 

(b) Apply the 6? process to the results of the Jacobi iteration in Example 9.4 with i = 2. 
How do you explain this result? 

(c) Apply the 5? process to the Gauss-Seidel results with i = 3. How do you explain this 
result? 


35 (a) Determine the spectral radius of the B matrix for the Gauss-Seidel iteration for the 


system 
.96326x"") + .81321x"?) = 88824 


81321x" + .68654x"?) = .74988 


(b) Compute the exact solution of this system to five decimal places using Cramer’s rule. 

(c) Attempt to use the Gauss-Seidel method to find a solution of this system starting with 
x = 33116, x'?) = .70000. How do you explain your result? [Ref.: Wilkinson (1961), 
pp. 328-329.] 


36 Use (9.7-30) to derive (9.7-31) and (9.7-32). 
*37 Consider the nonstationary iteration 


X;4, =[(1 + @,)B — @,1}x; + (1 + @,)Cb 


where B and C satisfy the consistency condition (9.6-11). 
(a) Show that this iteration also satisfies the consistency condition and that K; defined by 
(9.6- 16) is 


K,(B) = Tl [(1 + @,)B - o,1] 
(b) By letting w,; = y;/(1 — y;) show that 


i 
—y,l 
K(B) = 115 


— Vj 


(c) If B is symmetric, deduce that p[K,(B)] = max, | K,(u,)|, where u,, j = 1,..., n, are the 
eigenvalues of B. 

(d) Suppose we know that —1 < x9 <p; <x, < 1 for all u,. Show then that in order to 
minimize the magnitude of p[K,(B)| we should like to have 
T,(ax + b) 
T,(a + b) 


where 7,(x) is the Chebyshev polynomial of degree i and 


K;,(x) = 


; b= — 217 %0 


X; — Xo X;— Xqg 


Qa= 
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(e) Thus deduce that the y,'s should be chosen as the zeros of T;(ax + b) with the w,’s then 
given by part (b). 

(f) Suppose these w,’s have been used to compute x;,,. Besides requiring some know- 
ledge of the eigenvalues of B, what disadvantage does this technique have if, having computed 
X;. 1, we now desire x,, .? [Ref.: Sheldon (1960), pp. 146-147.] 


38 Consider the system 


x) — 4,0) 4,4 24 
x) — 4403) dy) 2 

4x) 1,2) 4 x3) =41 
4,0) _ 4,02) + x od 


(a) Starting with x, = 0, do four iterations of the Jacobi method. 

(b) Using the same starting vector do four iterations of the Gauss-Seidel method. 

(c) What is the true solution of the system? 

39 (a) Calculate the B matrix for the Jacobi method for the system of the previous 
problem and find its spectral radius. 

(b) Similarly, find the spectral radius of B for the Gauss-Seidel iteration. 

(c) Are the results of the previous problem in accord with these results? 


Section 9.8 


40 Prove that the inverse of a triangular matrix is a triangular matrix of the same form. 


41 (a) Derive the algorithm for inverting the upper triangular matrix U which is analo- 
gous to (9.8-3). 

(b) Calculate that the inversion of L and U and the computation of U~'L™* requires 
4n3 + O(n?) multiplications and divisions and thus deduce that the complete inversion of A by 
this method requires n* + O(n?) operations. 


42 (a) Show that the inverse of a matrix A can be found by solving the n systems 
AX =@; i=1,...,m 


where e; is the vector with a | in the ith position and zeros elsewhere. 
*(b) How many operations are required to find the inverse in this manner? 

(c) Show that the Gauss-Jordan reduction of A to / is equivalent to the premultiplication 
of A by a sequence of elementary matrices and thus deduce that A~' can be found by applying 
to I the same operations required to reduce A to J by the Gauss-Jordan reduction. What is the 
connection of this method with that of part (a)? 

43 (a) If A = LU, where L and U are as in Sec. 9.8, show that A>'L = U7! 

(b) If U~! has been calculated as in Prob. 41, derive an algorithm for computing A™' 
directly one column at a time. 

(c) How many operations are required to calculate A~' in this manner? [Ref.: Bodewig 
(1959), pp. 214-215.] 


44 (a) Consider the two equations 
UA '=L! A-'L=U~! 


Show that these are 2n? equations for the elements of A~', L”', and U~! but that n? — nof the 
equations involve only zero elements of L’' and U~! and n equations involve elements of L”' 
known to be 1. 

(b) Thus derive n* equations for the n? elements of A~' which involve only known 
elements of U~' or Lt. 

(c) Derive an algorithm for solving these equations. [Ref.: Bodewig (1959), pp. 215-216.] 
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45 Indicate how the matrix inversion scheme of Sec. 9.8 can be simplified if A is symmet- 
ric. How many operations are required for the inversion in this case? 


46 (a) Let u and v be column vectors and A a nonsingular square matrix. Verify that 
(Av tu)(v"A" ') 


A T\)-!~ 47} 
(4 + uv") 1+v7Aq ‘tu 


(b) Let 
B=D+ vu,v 


i=1 


where D is a nonsingular diagonal matrix. Define 


k -1 
C, = ( Yuy? + o| 
i=1 
Use the result of part (a) to prove that 


C.41=C,- (Cates vb 1G) 
1+ V4 1C, Uys 


(c) Use the result of part (b) to deduce an algorithm for calculating B~ ! with B as given in 
part (b). [Ref.: Wilf (1960), p. 73.] 

47 Let B be an n x n matrix such that b,, # 1 and let A = B — I. Define a sequence of 
matrices A“) = [a{?] such that 


AY = A 


(kK+1) — p(k) __ 
aij = ai; 


(a) Prove that af =0 ifi<k orj <k for k = 1,..., n, and thus deduce that A“*") = 0. 

(b) Thus show that B may be expanded as in the previous problem with D =I, 
ul) = alla, o = aff. 

(c) Specialize the algorithm of part (c) of the previous problem for this expansion. [Ref.: 
Wilf (1960), p. 74.] 

48 Invert the matrix of coefficients in Prob. 18a using (a) the method of Sec. 9.8; (b) the 
method of Prob. 42; (c) the method of Prob. 43; (d) the method of Prob. 44; (e) the method of 
Probs. 46 and 47. 


49 Repeat the calculations of the previous problem using the matrix of Prob. 18b. 


Section 9.9 


50 Show that if the matrix A is positive (semi-) definite, then all of its eigenvalues are 
positive (nonnegative). 

51 (a) Show that x7A7Ax = || Ax||3 and hence that A’A is positive semidefinite. 

(b) Show that if the columns of A are linearly independent, then A’ A is positive definite. 

52 (a) Show that if P is defined by (9.9-5), then P? = I. 

(b) Verify that P is symmetric and hence, using part (a), orthogonal. 

53 (a) Show that for any nonsingular matrix B, the spectral condition number of B’B is 
K,(B"B) = [K,(B)]’. 

(b) Show that if PA = (R, 0)’, where P is orthogonal, then ATA = R™R. 

(c) Hence show that K,(A7A) =[K,(R)]? if R is nonsingular. 

54 Determine the number of multiplications and divisions required to solve the system 
Ax = b by the method of Householder transformations. 
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55 Solve parts (a) and (b) of Prob. 19 of Chap. 6 using the method of Householder 
transformations. 

56 Denote by a;, and aj,, i= 1,...,n,j = 1,..., m, the elements of A, and A,,, = P, A,, 
respectively. Show that a;, can be computed as follows: 


Qi; = aj; i=l, snjj=l, ,k-—1 
i=1,...,.k-—1;j=k,...,m 
a; , = 90 i=k+1,...,n 
u=a,, %$i=kti,...,n a= Yu? 6 =sgn (a,,)(az, +a)? 
t=k+1 
+5 8B ; kk = 0 
u, =@ = --- (si = 
k k,k a+ u2 kk 
y=Bdiua; a,=a;-yu, i=k...n joktiy...,m 
i=k 


57 What are the values ofc, t, t,, s,, s, and a; of the linear programming problem defined 
by (9.9-11) and (9.9-12) for the problem defined by (9.9-14) and (9.9-15)? 

58 (a) Formulate the problem of finding the best L,, approximation to the data of 
Example 9.5 by a polynomial of degree 4 as a linear programming problem. 

(b) Repeat part (a) for the best L, approximation. 


Section 9.10 


59 (a) Show that if x, and x, are feasible solutions, then so is «x, + (1 — «)x, for any 
such thatO <a < l. 
(b) Show that if f(x) is a linear function of x, then 


F (ax, + (1 — o)x2) = af (x,) + (1 — a) f(x). 
(c) Use the results of parts (a) and (b) to show that any local minimum of f (x) over the set 


of feasible vectors is also a global minimum. 


60 (a) Let x = (x;, ..., x,)’ be feasible solution of (9.10-2) and (9.10-3) such that the 
columns of A corresponding to nonzero components of x are linearly dependent. Show that we 
can find another feasible solution x* = (xf, ..., x*)’ with at least one less nonzero component. 

(b) Use part (a) to show that if there is a feasible solution to (9.10-2) and (9.10-3), there 
exists a basic feasible solution. 

(c) Let& = (x,,...,%,)? be an optimal solution of (9.10-1) to (9.10-3) such that the columns 
of A corresponding to nonzero components of x are linearly dependent; 1.e., there exists a vector 
y =(y1,..., y,)’ such that }7_, y,a; = 0 and X, y, # 0 for some k. Show that c’y = 0. 

(d) Use parts (b) and (c) to show that if there is an optimal solution to (9.10-1) to (9.10-3), 
then there exists a basic optimal solution. 


61 Solve the following linear programming problem by the simplex method: 
Minimize x, + 6x, — 7x, +X, + 5x, 
subject to the constraints 
5x, — 4x, + 13x, — 2x4 + x5 = 20 
X,—-X,+5x3-xX,+x5=8 
x, 20 i=1,...,5 
[Ref.: Dantzig (1963), pp. 108-110.] 
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Section 9.11 


62 Verify (9.11-3). 

63 Evaluate the determinants of the matrices of Probs. 13, 15, 18, 27, and 28. 

64 (a) Show that if A has bandwidth [w,, w,], then, even with partial pivoting, m,, = 0, 
i< Wi + 1. 

(b) Let A® be the matrix of the kth derived system in Gaussian elimination. Show that in 
the absence of pivoting, A“ and consequently all A™, k = 1,...,n — 1, have bandwidth at most 
[w,, wa. 

(c) Show that if w, + w, <n — 1, then the maximum bandwidth of A") is [w,, w. + w,] 
if we allow partial pivoting. Show further that the maximum bandwidth of all subsequent 
A™ is [w,, w2 + W,], So that U = A"~") has bandwidth [0, w, + w,). 

(d) Show that in both cases, my, = 0, i > w, + k, so that the matrix L = [m,,] has band- 
width [w,, O]. 

65 Determine the LU decomposition of a tridiagonal matrix A using (9.3-34) and (9.3-35) 
and verify (9.11-8) and that the superdiagonal of U is equal to that of A. 

66 Give an algorithm to solve a tridiagonal system by Gaussian elimination, partial 
pivoting, and implicit row equilibration which requires five vectors for storage if we allow 
overwriting. Determine the maximum number of additions, multiplications, and divisions 
required. 


67 Show that if we do forward and back substitution with band matrices L and U of 
bandwidth [w,, 0] and [0, w,], respectively, where L is unit triangular, then the total number of 
operations is 


_ Wilwy = 1) wala — 1) 


—2 
n(w, + w, ) 5 5 


additions and multiplications and n divisions. 


68 (a) Let A be an upper Hessenberg equilibrated matrix. Prove by induction that if we 
apply Gaussian elimination with partial pivoting, then |a‘"")| <r, so that the growth factor 
gan. 

(b) Let A be a tridiagonal matrix and apply Gaussian elimination with pivoting. Show 
that g < 2. 

(c) Let A be a band matrix of bandwidth [k, k] and apply Gaussian elimination without 
pivoting. Show that g < 2". 

69 Assume that we carry out the process of Gaussian elimination with partial pivoting 
from bottom to top and from right to left rather than the usual way of top to bottom and left to 
right. 

(a) Show that we arrive at the decomposition PA = UL, where U is a unit upper triangular 
matrix and L is a lower triangular matrix. 

(b) Give the formulas for the elements of U and L in the case where P is the unit matrix, 
1.€., NO pivoting. 

(c) Show that with no pivoting, if A is a band matrix of bandwidth [w,, w.], then U and L 
have bandwidths [0, w,] and [w,, O], respectively. 

(d) Show that with partial pivoting U still has bandwidth [0, w.] while L has bandwidth 
[0, min (w, + w,, n — 1)]. Hence, for lower Hessenberg matrices, we have a decomposition 
requiring only O(n’) multiplications. 


CHAPTER 


TEN 


THE CALCULATION OF 
EIGENVALUES AND EIGENVECTORS 
OF MATRICES 


10.1 BASIC RELATIONSHIPS 


Let A be a square matrix of order n. Its eigenvalues A,, ..., A, are the 
solutions of the determinantal equation, called the characteristic equationt 


|A—AI| =0 (10.1-1) 


Since |A‘ — AI| = |A — Al|, it follows that the set of eigenvalues of A? is 
identical to that of A. Corresponding to each distinct eigenvalue A; there 
exists at least one nontrivial solution (determined to within a multiplicative 
constant) of the system of linear equations 


Ax = A;X (10.1-2) 


This solution x? = (x$?, x{?, ..., x!) is a right eigenvector of A. (In what 
follows the term eigenvector will refer exclusively to right eigenvectors.) A 
left eigenvector corresponding to A, is a nontrivial solution of 


y'A=Ay! (10.1-3) 


+ Since the great majority of computational problems involve real matrices, we shall assume 
for simplicity throughout this chapter that A is real. As will generally be clear from the context, 
many of the theorems we shall consider are also true when A is complex or, if the real matrix is 
assumed symmetric, when A is Hermitian. Of course, even for real matrices, eigenvalues and 
eigenvectors may be complex. However, the eigenvectors corresponding to real eigenvalues are 
real. 
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Therefore, a left eigenvector of A is an eigenvector of A’. It is easily shown 
{1} that ify, and x, are, respectively, left and right eigenvectors corresponding 
to distinct eigenvalues, then y, and x, are orthogonal. In this chapter we 
shall be interested in methods for the calculation of some or all of the 
eigenvalues and some or all of the eigenvectors of A. First, however, we shall 
review some of the basic theorems and relationships which are necessary to 
understand the development that follows. 


10.1-1 Basic Theorems 


In this section we list four basic theorems concerning the eigenvalues and 
eigenvectors of a matrix with which the reader should be familiar. Proofs of 
three of them are left to a problem {2}. 


Theorem 10.1 If 4,, 45, ..., A, are the eigenvalues of A, then the 
eigenvalues of A* are A‘, AK, ..., Ak. More generally, if p(x) is a polyno- 
mial, the eigenvalues of p(A) are p(A,), ..., p(A,)- 


Theorem 10.2 If A is real and symmetric, all eigenvalues and eigenvec- 
tors are real. Moreover, eigenvectors corresponding to distinct 
eigenvalues are orthogonal, and the left eigenvector corresponding to 
the eigenvector x; is x/. 


Theorem 10.3 Any similarity transformation PAP™' applied to A 
leaves the eigenvalues of the matrix unchanged. 


PRooF Let A be an eigenvalue of A and x the associated eigenvector. 
Then 


Ax = Ax 
so that PAx = APx (10.1-4) 
Let y = Px, so that x = P™ 'y. Substituting in (10.1-4), we get 
PAP™'y = Ay (10.1-5) 


Thus A is an eigenvalue of PAP™', and y is the associated eigenvector. 


Theorem 10.4 (Cayley-Hamilton) Let 
f(a) = |A—al| =0 (10.1-6) 


be the characteristic equation of A. Then f(A) = 0. 
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10.1-2 The Characteristic Equation 


One method of determining the eigenvalues of A is to find the roots of the 
polynomial equation (10.1-1). For this purpose the methods of Chap. 8 can 
be used. But before we can use most of these methods to find the roots of 
(10.1-1) we must determine the coefficients of the characteristic equation 
itself. Direct calculation of these coefficients from the definition of the deter- 
minant is never recommended except for very low-order matrices, since it 
involves an astronomical amount of calculation. Here we present two 
methods for finding the coefficients of the powers of A in the characteristic 
equation without direct evaluation of the determinant. 


Krylov’s method We write the characteristic equation as 
y q 


f(a) =a" pis =0 (10.1-7) 
From the Cayley-Hamilton theorem we have 
A" + 5 bi = 0 (10.1-8) 
Then, for any vector y, 
A"y + Yb =0 (10.1-9) 


Equation (10.1-9) is a system of n linear equations in the n unknowns bo,..., 
b,-, which can be solved by any of the methods of the previous chapter. 
Note that the calculation of A’'y = A(A'~ ‘y) requires n? multiplications so 
that about n° multiplications are required to establish (10.1-9) followed by 
about 4n° to solve (10.1-9) by one of the methods of Sec. 9.3. 


LeVerrier’s method Using the property that the sum of the eigenvalues of 
any matrix is equal to the trace of the matrix, and using Theorem 10.1, we 
can write 


Yak=u k=leyn (10.1-10) 


where t, is the trace of A*. In Sec. 8.10-3 we noted that the sums of the 
powers of the roots of a polynomial could easily be computed using the 
coefficients of the polynomial. This process is easily reversed ; the coefficients 
of the polynomial are easily computed if the sums of the powers of its roots 
are known {3}. Thus, using (10.1-10), we can compute the coefficients of 
the characteristic equation. But LeVerrier’s method is much inferior to 
Krylov’s because of the necessity of actually computing A“, k = 1,..., n. 
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In a problem {4} we consider another technique of computing the 
coefficients of the characteristic polynomial. In general, direct computation 
of these coefficients is less efficient than the methods which will be discussed 
later in this chapter. Moreover, it may convert a well-conditioned problem 
into an ill-conditioned one, since, as we have seen in Sec. 8.13, small errors in 
the coefficients of a polynomial can induce large errors in some of its roots. 
Thus, direct computation of the coefficients of the characteristic polynomial 
is not recommended as a method for computing eigenvalues and is only of 
historical interest. 


10.1-3 The Location of, and Bounds on, the Eigenvalues 


In this section we consider some of the more important among the many 
theorems which deal with the location of the eigenvalues of a matrix, i.e., the 
location of the zeros of the characteristic polynomial. These theorems can be 
used, for example, to estimate the magnitude of the largest and smallest 
eigenvalues in magnitude and thus to estimate p(A) and the condition of A. 
Such estimates can also be used to generate initial approximations to be 
used in iterative methods for determining eigenvalues (see Sec. 10.2). 


Gerschgorin’s theorem Let A = [a,;] and let C;, i= 1, ..., n, be the circles 
with centers a;; and radii 


°, = »y | ix | i=1,....n (10.1-11) 


Further let 
D=\JC,; (10.1-12) 
Then we can state the following theorem. 


Theorem 10.5 (Gerschgorin)} All the eigenvalues of A lie within the 
domain D. 


Proor We use the result of Prob. 30 of Chap. 9 that if the component of 
largest magnitude of the solution of Bx = 0 is the ith component, then 


\bi| < >» | Dix | (10.1-13) 
k#i 


+ This theorem is also valid when the elements of A are complex. 
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Letting B = A — AI, where J is an eigenvalue of A, we have 


[A—ai| < ¥ lax (10.1-14) 
k=1 
kFi 


Equation (10.1-14) holds for any eigenvalue (although i may vary from 
one eigenvalue to another), and this is sufficient to prove the theorem. 


As a consequence of Theorem 10.5 we get one of the results of 
Prob. 30b of Chap. 9. 


Corollary 10.1 The spectral radius of A~’ is such that 


qa 2 min a - ¥ lau! (10.1-15) 


kF#i 


By making use of the result of Theorem 10.3, it may be possible to 
improve upon the bound for the eigenvalues given by Gerschgorin’s theorem 
by first applying a similarity transformation to A {5}. 


Other theorems A number of theorems give information about the 
eigenvalues of a matrix by using related norms and analogous quantities. 
The following are theorems of this type. 


Theorem 10.6 Let A be symmetric and positive definite. Then 


x* Ax 
= 10.1-1 
p(A) max (10.1-16) 
* 
xr Ax (10.1-17) 


——_—_~ = min ——— 
p(At) —, x*x 
where x is an arbitrary real or complex nonzero vector, and, as it will 


throughout this chapter, the superscript * denotes the conjugate 
transpose. 


PRooF Since A is symmetric and positive definite, all its eigenvalues are 
real and positive and can be ordered as follows: 


Ay >a, > DA, >0 (10.1-18) 
so that p(A) = 4,. Now for any nonzero vector x, 
x*Ax  x*(A, 1] — A)x 


Ay _ x*x re > 0 (10.1-19) 
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since all the eigenvalues of 4,] — A are nonnegative, so that A,/] — A is 
positive semidefinite. On the other hand, 


* 
xfAXx, 


= 10.1- 
1 Rx, (10.1-20) 
where x, is the eigenvector corresponding to /,, proving (10.1-16). 
Similarly, we can show that 
x* Ax 
d, = mi 10.1- 
kh min xi (10.1-21) 


Since the eigenvalues of A~* are Ay', ..., 4, ', so that p(A~') = a7}, 
(10.1-17) holds. 


From this proof, it is readily seen that if A is symmetric but not positive 
definite, the theorem is still true if p(A) is replaced by A,,,, and 1/p(A~"*) is 
replaced by A,;,, where Ayin < A; < Amax for any eigenvalue A;,. 


Theorem 10.7 For an arbitrary nonsingular matrix A 


! 
p[(A7A)~*] 


where /; 1s any eigenvalue of A. 


As | |? < p(ATA) (10.1-22) 


ProoF Let x; be the eigenvector corresponding to A,. Then 


Ax; = A; X; (10.1-23) 
and x*A7 = A; x* (10.1-24) 
Therefore, x*ATAx; = |A,|?x?x; (10.1-25) 


Using Theorem 10.6 we have, since A’A is positive definite, 
x* A! Ax; . 1 

xix; ~~ p[(A7A)"*] 
which with (10.1-25) proves the theorem. 


If A is singular, the theorem is true if the left-hand side of the 
inequality (10.1-22) is replaced by 0. 


p(ATA) > (10.1-26) 


10.1-4 Canonical Forms 


One of the most important techniques in the calculation of eigenvalues and 
eigenvectors of matrices is to transform the given matrix into some canoni- 
cal form. Here we consider some of the theorems dealing with these canoni- 
cal forms. In particular we shall indicate the basic differences that occur in 
considering symmetric and nonsymmetric matrices since they have impor- 
tant implications on what follows in this chapter. 
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Theorem 10.8 Eigenvectors of an arbitrary matrix A corresponding to 
distinct eigenvalues are linearly independent. 


PROOF By induction. Let x, and x, be eigenvectors corresponding to A, 
and A,, 4, #A,. Then if a,x, + a,x, = 0, we also have 
a, Asx + a,A, X>4 = 0 


which implies a, = a, = 0 (why?). Now suppose x,, ..., X, Correspond- 
ing to distinct eigenvalues A,, ..., A, are independent. Then if x,,, 
corresponds to A,, ,, consider 


a,X, tt + Aga Xa = 0 (10.1-27) 
Premultiplying by A, we have 
a,A1X, Hee tas Apa Xe4 = 0 (10. 1-28) 


If A,+1 =O, it follows immediately from (10.1-28) and the induction 
hypothesis that a,,, = 0. If A,., #0, divide (10.1-28)by A,, , and sub- 
tract from (10.1-27) to obtain 


A A 
a,{1— —*-)x, +---+4,{1-—*]x,=0 — (10.1-29) 
Aut A 


k+1 
Since A2,,..., A,4, are all distinct, the induction hypothesis again gives 
a, =a, =°'' =a, =0, and (10.1-27) then gives a,,, =0. This proves 


the theorem. 


Theorem 10.9 Let A be an arbitrary matrix whose eigenvalues are all 
distinct. Then there exists a similarity transformation such that 


P-'AP=D 


where D is a diagonal matrix whose diagonal elements are the 
eigenvalues of A. 


PROOF Let P be the matrix whose columns are the (right) eigenvectors 
of A. Since the eigenvalues are distinct, P exists (but may be complex). 
Then 


AP = PD (10.1-30) 
where D is a diagonal matrix with the eigenvalues on the diagonal. Since 


the eigenvectors are linearly independent by Theorem 10.8, P~' exists 
and 


P-'AP =D (10.1-31) 


which proves the theorem. Note that the rows of P~ ' are the left eigen- 
vectors of A. 
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The computational disadvantage of Theorem 10.9 is the need to find a 
matrix P and its inverse. It would be more convenient if we could diagona- 
lize a matrix using an orthogonal or unitary transformation. The most gen- 
eral theorem we can prove in this case is the following. 


Theorem 10.10 For an arbitrary matrix A there exists a unitary trans- 
formation Q such that 
O*AQ =T (10.1-32) 


where T is triangular (but may have complex elements). 
Note that this does not require that A have distinct eigenvalues, as 
in Theorem 10.9. 


PRooF The proof is by induction on the order n of the matrix. If n = 1, 
the theorem is true since a matrix of order 1 is triangular. Suppose the 
theorem is true for matrices of order n — 1 and let A be a matrix of order 
n. Let u, be an eigenvector of A of magnitude 1 corresponding to any 
eigenvalue, say /,. Let 


Ui, Vis -+*> Vn (10.1-33) 
be an orthonormal set of vectors.t If Q, is the matrix whose columns are 
Ui, V1, ---> Va»—-1, We have 

A, = OFAO, = |"! wv" (10.1-34) 
1 x¥1 1— 0 B ° 


By the induction hypothesis there exists a unitary matrix P of order 
n— 1 such that 


P*BP =T,_, (10.1-35) 

Now let jp @ (10.1-36) 
ow le Q,= 0 P . 
so that Q, is unitary of order n. Then from (10.1-34) to (10.1-36) 


A, w' A, w'P 
030140, 0, = os] [22 | we =T, (10.1-37) 
where T, is triangular of order n. Setting Q = Q, Q, proves the theorem. 


The proof of Theorem 10.10 implies that, if all the eigenvalues of A are 
real, then Q is a real orthogonal matrix (why?). In particular, since the 
eigenvalues of a symmetric matrix are real and since the orthogonal trans- 
formation of a symmetric matrix is symmetric, we have the following 
corollary. 


+ That iS, ufv, =0,i= l,..., n—- l, and viv, = 0:35 ij= l...n- 1. 


THE CALCULATION OF EIGENVALUES AND EIGENVECTORS OF MATRICES 491 


Corollary 10.2 If A is symmetric, then there exists an orthogonal matrix 
QO such that 


QTAQ =D (10.1-38) 


with D as in Theorem 10.9. 


For computational purposes this corollary is an improvement over 
Theorem 10.9 not only in that the similarity transformation is replaced by an 
orthogonal one but also in that there is no requirement that the eigenvalues 
be distinct. Since it follows from (10.1-38) and the orthogonality of Q that 


AQ = QD (10.1-39) 


we have the result that the columns of Q are the eigenvectors of A. 

Since a triangular matrix T yields its eigenvalues just as easily as a 
diagonal matrix, the reader may think that Theorem 10.10, except for the 
necessity of using complex arithmetic, will be as useful in practice as Corol- 
lary 10.2. But in practice it is much easier to diagonalize symmetric matrices 
than it is to triangularize nonsymmetric matrices, as we shall see in 
Secs. 10.3 and 10.4. 

Theorem 10.9 and Corollary 10.2 take care of the diagonalization of all 
but nonsymmetric matrices with multiple eigenvalues. The theorem covering 
this case is 


Theorem 10.11 Given an arbitrary matrix A, there exists a nonsingular 
matrix P, whose elements may be complex, such that 


Jy 
J. 
P-'AP= O _— O (10.1-40) 
oh 
where J,, k = 1,..., K <n, 1s a matrix with an eigenvalue /; of A on its 
main diagonal and 1s on the diagonal above the main diagonal. 
Thus 
A; 1 
it O 
J, = oh, te (10.1-41) 
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Note that a given eigenvalue may appear as the diagonal element of more 
than one J,. The matrix in (10.1-40) is called the Jordan canonical form of A. 
The determinants 

det (J, — AI) = (A; — A)" (10.1-42) 


where v, is the order of J,, are called the elementary divisors of A. If v, = 1, 
we say the elerhnentary divisor is linear. The proof of Theorem 10.11 is 
beyond the scope of this book [see, for example, Bodewig (1959), pp. 82-88]. 

It follows from Theorem 10.11 that, if the eigenvalues are distinct, all the 
elementary divisors are linear, which gives us another proof of Theorem 
10.9. Theorem 10.11 together with Corollary 10.2 implies that for a symmet- 
ric matrix the elementary divisors are linear whether or not the eigenvalues 
are distinct. More important is the result that if the elementary divisors are 
linear, then corresponding to an eigenvalue of multiplicity k there are k 
linearly independent eigenvectors. Therefore, Theorem 10.9 can be gener- 
alized to include all matrices with linear elementary divisors. If, however, 
there are nonlinear elementary divisors, then Theorem 10.11 implies that 
there are eigenvalues whose multiplicity is greater than the number of 
independent eigenvectors corresponding to them. The simplicity of the result 
of Corollary 10.2 in comparison with the results of Theorems 10.10 and 
10.11 is an indication that the calculation of the eigenvalues and eigenvec- 
tors of symmetric matrices will cause fewer difficulties than that of nonsym- 
metric matrices. The remainder of this chapter will bear out this indication. 

With the theorems of this section as background, we are now ready to 
consider methods for the calculation of eigenvalues and eigenvectors. We 
begin with methods for the calculation of the largest eigenvalue in 
magnitude. 


10.2 THE LARGEST EIGENVALUE IN MAGNITUDE 
BY THE POWER METHOD 


The basis of our techniques for determining the largest eigenvalue in magni- 
tude is very similar to Bernoulli’s method (Sec. 8.10-3) for determining the 
zeros of polynomials. And this, of course, should not be surprising because 
of the similarity between the problems of finding the eigenvalues of a matrix 
and the zeros of a polynomial. 

The basic assumption of this section is that all the elementary divisors of 
A are linear. However, the method of this section is often applicable even 
when A has nonlinear elementary divisors {9}. This assumption implies, as 
noted in the previous section, that there are n linearly independent eigenvec- 
tors of A and thus that the eigenvectors span n space. Therefore, any vector 
v, can be expressed as a linear combination 


Yo= > 4;X; (10.2-1) 
i=1 
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where x;, i= 1, ..., n, are the eigenvectors of A. If A; is the eigenvalue 
corresponding to x;, then 


AVy = > aA; X; (10.2-2) 


i=l 


and in general the mth iterated vector is given by 


Vin = A”Vo = y a; Ax; (10.2-3) 
j=] 


t 


Now we order the eigenvalues so that 


If then, in particular, 2, is dominant, that is, 
we have the following theorem. 


A,| > |A,|, and therefore real, 


Theorem 10.12 (Von Mises) If the matrix A has n linearly independent 
eigenvectors, and if the largest eigenvalue in magnitude A, is dominant, 
then if vp has a component in the direction of x,, 


1 
lim qm A”Vo = 4,X1 (10.2-5) 
mo “1 
The proof follows immediately from (10.2-3). The requirement that vy have a 
component in the direction of x, is assurance that a, #0. 
As a consequence of this theorem we see that if y is any vector not 
orthogonal to x,, then, using (10.2-5), 


T. 
dy = lim Lom# (10.2-6) 
m-> 00 y Vin 

The numbers y’v,,,, = y’ Av,, are called Schwarz constants. A convenient 
choice of y in practice is the vector with a component of 1 in the position 
corresponding to the maximum component of v,, and 0 elsewhere. This 
minimizes the computation in (10.2-6). Of course, early in the computation 
the largest component of v,, may vary, but ultimately it will be the largest 
component of the eigenvector. In practice, then, we compute successive 
approximations to A, as the ratio of the largest component of successive v,,’s. 


Example 10.1 Find the dominant eigenvalue of the matrix 
10 1.0 5 
A=]10 10 .25 
5 25 2.0 


using vg = (1, 1, 1). 
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In the table below each v,, was calculated using (10.2-3) and then “normalized” to 
make its largest component 1 before doing the next iteration. The quantity A is the 
normalizing factor and thus represents the ratio of the largest components of v,, and v,,- 1; 
for example, the first entry in the A column is the ratio of the third component of the 
unnormalized v, to the third component of vo. 


m v, (normalized) A m v7 (normalized) A 


| (.9091, .8182, 1.0000 2.7500000 15 (.7483, .6497, 1.0000) 2.5366256 


) 

5 (.7651, .6674, 1.0000) 2.55879 18 16 (.7483, .6497, 1.0000) 2.5365840 
10 (.7494, .6508, 1.0000) 2.5380029 17 (.7482, .6497, 1.0000) 2.5365598 
11 (.7489, .6504, 1.0000) 2.5373873 18 (.7482, .6497, 1.0000)  §2.5365456 
12 (.7486, .6501, 1.0000) 2.53 70284 19 (.7482, .6497, 1.0000)  2.5365374 
13 (.7484, .6499, 1.0000) 2.5368188 20 (.7482, .6497, 1.0000)  2.5365323 
14 (.7484, .6498, 1.0000) 2.5366969 


The calculations were performed on a digital computer using floating-point arithmetic 
with an 8-digit fractional part. The values of v,, are rounded values. To eight figures the 
true value of A is 2.5365258 and that of x, is (.74822116, 64966116, 1.00000000). These 
values were reached in 28 iterations. Although a symmetric matrix was used in this 
example, remember that the power method requires only that the elementary divisors be 
linear. 


From (10.2-3) 
n A, m 
Vin = AT aux + y (*) X; X; | (10.2-7) 
i=2 1 


Therefore, the rate of convergence of the power method depends upon how 
fast the ratios (A, /A,)" go to zero; in particular this rate depends upon the 
ratio |A,|/|A,|. The number of iterations required to get a desired degree of 
convergence depends upon both the rate of convergence and on how large a, 
is compared with the other «;, the latter depending in turn on the choice of 
v,. In Example 10.1 the convergence was quite slow. In later examples we 
shall see how these two factors—|A,/A,| and «,—affect the convergence. 
Later in this section we shall also consider means of speeding up the 
convergence. 

If v) has no component in the x, direction and |4,| > |A,], then the 
iteration should converge to 2, and x, (why?). In practical computation, 
however, roundoff error will generally introduce a component in the x, 
direction so that eventually the iteration will converge to 4, and x,. Never- 
theless a good approximation to A, and x, can be obtained before the term 
in A, dominates. 


Example 10.2 Apply the power method to the matrix of Example 10.1 using 
vi = (—.64966116, .7482216, 0) which is orthogonal to x,. Therefore, since A is symmet- 
ric, v) has no component in the x, direction. 
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Corresponding to the table in Example 10.1 we have 


m_ vi (normalized) A mv! (normalized) A 

1 (—.7154, —.7154, 1.0000) —.1377753 10 (—.6369, —.8058, 1.0000) 1.4801240 
2 (—.6360, —.8068, 1.0000) 1.4634741 15 (—.6368, —.8057, 1.0000) 1.4801606 
3 (—.6369, —.8058, 1.0000) 1.4803108 20 (—.6356, —.8044, 1.0000) 1.4807007 
4 (-—.6369, —.8058, 1.0000) 1.4801194 30 (—.4008, —.5576, 1.0000) 1.5931941 
5  (-—.6369, —.8058, 1.0000) 14801216 40 ( .7180, .6179, 1.0000) 2.4976711 
6 (—.6369, —.8058, 1.0000) 14801217 50 ( .7481, .6495, 1.0000) 2.5363412 
7  (—.6369, —.8058, 1.0000) 14801219 61 ( .7482, .6497, 1.0000) 2.5365253 


Initially the iteration converges to a value very nearly equal to A, = 1.4801215, but 
although v, is orthogonal to x,, roundoff introduces a component in the x, direction and 
slowly, but very slowly, the term in x, dominates, as in Example 10.1. The reason the 


convergence to A, 1s so good and so rapid is that, as we shall see in a later example, the 
ratio of |A,/A,| is very large. 


The drawbacks to this method are similar to those of Bernoulli’s method, 
in that special cases require special treatment. However, when the dominant 
eigenvalue is multiple but real, the method does converge. For if 4, has 
multiplicity k, 


k n 
Vn = A”'Vo = Av y A; X; + y le X; (10.2-8) 


i=1 i=k+1 


and again the term in 2, dominates. Further, since )#_, «;x;, is an eigenvec- 
tor the process converges to an eigenvector as before. A procedure to get the 
other eigenvectors corresponding to A, is considered in a problem {10}. 


10.2-1 Acceleration of Convergence 


Because the power method may converge slowly, means of accelerating the 
convergence are clearly desirable. In this section we consider four such 
means. 


The 5? process This application of the 5” process is similar to our previous 
applications. We assume that both J, and A, are real and that neither —/, 
nor —A, is an eigenvalue. Let e; be the ith column of the identity matrix /. 
Then from (10.2-3) we calculate the Schwarz constant 


efV, = >, a,A" (10.2-9) 
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where a, depends on «, and the ith component of x;. Then 


n 
r a Att + Yast} 
CE Ym+1 _ j=2 


e/V,, (10.2- 10) 


n 
a,At + > ajar 
jr2 


Dividing numerator and denominator by a,A7T and expanding the denomin- 
ator in a power series, we get 


+ terms in 3)" bees (3:)° and higher powers 
(10.2-11) 
If we are near convergence, the terms in brackets are small and we have 
A, ~R,, — Br" (10.2-12) 


with R,, = e/v,,+,/e/v,, and r = 2,/A,. Then proceeding as in Sec. 8.7-1, we 

get as a better approximation to A, {15} 

(ARm+ 1)? 
A?R,, 

If we apply (10.2-13) with m= 10 and i = 3 in (10.2-11) to the results in 


Example 10.1, we get as our new approximation 1, = 2.5365266, which is 
better than that achieved after 20 iterations. 


Ay © Ring 2 — 


(10.2-13) 


Wilkinson’s method Suppose that all the eigenvalues are real but that the 
power method is converging slowly because there are two eigenvalues nearly 
equal in magnitude. Consider the matrix A — pl, which has eigenvalues 
A; — p. By a judicious choice of p, it may be possible to speed up the conver- 
gence markedly. The optimal choice of p if we wish to converge to J, is that 
value which minimizes 


max 
2Sisn 


i= P | 
A, —p 
Hence, some knowledge of the eigenvalues of A is needed to apply this 
technique. For example, if all the eigenvalues are positive, the optimal value 
of p is (A, + A,)/2 {16}, where A, is the smallest eigenvalue of A. The choice 
p =A,, will also yield an improved rate of convergence. 

This method is an excellent example of one which, when used by an 
experienced numerical analyst, can be very powerful; but if used haphaz- 
ardly, i.e., if the value of p is not chosen judiciously, it will be little better 
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than the power method. As an example of this method, if we use p = .75 in 
Example 10.1, then, in place of the table in that example, we get 


m v. (normalized) A 

5 (.7516, 6522, 1.0000) 1.791401 
6 (7491, 6511, 1.0000) —_‘1.7888443 
7 (7488, .6501, 1.0000) —_1.7873300 
8 (.7484, .6499, 1.0000) 1.7869152 
9 (.7483, .6497, 1.0000) _‘1.7866587 
10 (7482, .6497, 1.0000) 1.7865914 


which is a better result than after 15 iterations in Example 10.1. Convergence 
to 2, = 1.7865258 was achieved in 19 iterations. The reason why p = .75 was 
a good choice will become apparent when we calculate all the eigenvalues of 
the matrix A in Example 10.4. 


The Rayleigh quotient If A is symmetric, then the eigenvectors are ortho- 
gonal, and if we consider them to be orthonormal, 


V" AV = VaVm+1 = > apapme! (10.2-14) 
i=1 
and VIVin = >, apazm (10.2-15) 
i=1 
We can write 

Vi AVin Vn Vin +1 A;\?" 
mONm _ YmYm+1 _ A 10.2-16 
vies arest = 4 +0/(24 202-19) 


Y'Vin+1 _ 4 m 
ry +0/(2) | (10.2-17) 


Since the higher-order terms in (10.2-16) will usually be smaller than those in 
(10.2-17), the former will generally give a better approximation to A, than 
the power method itself. For example, consider Example 10.1 with m = 11. 
The unrounded values of v7, and vj, are, respectively, (.74888011, .65035358, 
1.0) and (.74860561, .65006512, 1.0). For these vectors the left-hand side of 
(10.2-16) is 2.5365256, which is a slightly better result than that achieved 
using the 67 process, which also used data through the twelfth iteration. It is 
generally true that this technique will give better results than the 57 process 
{17}. Therefore, for symmetric matrices, it is to be preferred. The quotient in 
(10.2-16) is the Rayleigh quotient and bears a close resemblance to the quo- 
tient in Theorem 10.6. 
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Matrix powers If two eigenvalues are nearly equal in magnitude, then in 
order to separate the eigenvalues we can compute A’, A*, A®, .... This 
technique is directly analogous to the root-squaring procedure of Sec. 8.10-2 
but is, of course, very inefficient because the computation of each power of A 
requires n° operations. It is therefore not recommended. 


10.2-2 The Inverse Power Method 


If A is a nonsingular matrix with eigenvalues 1,, ..., A,, then the inverse 
matrix A~! has the eigenvalues A; ',..., A, '. Hence, if A — p/ is nonsingu- 
lar, the eigenvalues of (A — pI)! are (A, — p)7',..., (A, — p) '. If we now 
apply the power method with (A — pI)~', we find that, starting as before 
with vp, satisfying (10.2-1), the mth iterated vector v,, is given by 


Vn =(A— pl)"vo = Ya — p) "x; (10.2-18) 


i=1 


If now p is close to one of the distinct eigenvalues 4,, which may be a 
multiple eigenvalue but which for simplicity we assume to be of multiplicity 
1, and if 2, is separated from all the other eigenvalues so that |p —,| < 
|p —A;|,i #j, then, provided that «; # 0, the term a,(A; — p)” x; will dom- 
inate the sum on the right-hand side of (10.2-18) for sufficiently large m. The 
vector v,, Will then be a very good approximation to the eigenvector corre- 
sponding to A;. This dominance shows up quite early if p is very close to 4,; 
in this case, even v, will be effectively an eigenvector. This 1s so even if vp has 
arelatively small component in the x, direction and A, is not particularly well 
separated from the other eigenvalues. For example, if A; — p= 107 °°, 
|, —p| > 10°°, k #j, and a; = 10°, then {20} 


10° '°v, =x,;+y (10.2-19) 


where |y| < 107 '°. 

The inverse power method is applied in two different contexts. The first 
is in an iterative method to find eigenvalues and eigenvectors using a gener- 
alized version of the Rayleigh iteration given in Sec. 10.2-1. We give here the 
application of this method to symmetric matrices and treat the general case 
in a problem {21}. Starting with an initial approximation vp to an eigenvec- 
tor, we compute the Rayleigh quotient 


vi Avo 


My = (10.2-20) 


VOVo 
followed by 
Vi = k,(A —_ by I) Vo (10.2-21) 
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where k, is a normalizing factor chosen so that ||v, || = 1 for one of the 
standard norms and yp, plays the role of p in (10.2-18). In general, we 
compute 

v) Ay, 


vi Vi 


(10.2-22) 


Hi+1 = 
Vier = kina (A — wisi!) *v; (10.2-23) 


Under appropriate hypotheses, it can be shown that the sequence {p;, v;} 
converges cubically to an eigenvalue-eigenvector pair {A, v} which depends 
on the initial approximation Vo. 

In practice, the inverse (A —;,,/)~' is not computed explicitly. 
Instead, the set of linear equations 


(A —_— i+ 1 1)z;4 1 = V; (10.2-24) 


is solved, and then we use V;4,; = k,;41Z;,4,. Even though the coefficient 
matrix A — y;! approaches singularity as yu; approaches an eigenvalue, this 
does not detract from the accuracy of the calculation provided that partial 
pivoting is used to solve (10.2-24) (see Sec. 9.3). We mention at this point 
that the generalized Rayleigh iteration as applied to nonsymmetric matrices 
is intimately related to the Jenkins-Traub method for finding the zeros of a 
polynomial (see Sec. 8.11). 

The second context in which the inverse power method 1s used is to find 
eigenvectors corresponding to computed eigenvalues by the method of in- 
verse iteration, which uses (10.2-18) with a fixed value of p corresponding to 
an accurate approximation to an eigenvalue. Of course, if the eigenvalue has 
been computed by the power method, we have the corresponding eigenvec- 
tor as well. However, there are many methods studied later in this chapter 
which compute only eigenvalues and not eigenvectors. Combining the in- 
verse power method with one of the better methods for computing 
eigenvalues given in succeeding sections yields one of the most efficient ways 
to compute the eigensystem of a matrix. It is especially efficient if A has been 
reduced to tridiagonal or Hessenberg form (see Secs. 10.3 and 10.4), for then 
the amount of work per eigenvector is reduced from O(n*) operations to 
O(n) or O(n?) operations, respectively (see Sec. 9.11). (The same reduction in 
work per iteration for these special matrices occurs also in the Rayleigh 
quotient method.) 

In inverse iteration, we assume that we have computed an accurate 
approximation / to an eigenvalue J and wish to compute the corresponding 
eigenvector x. Even if A differs from 4 only by rounding error, the matrix 
A — AI is still nonsingular, so that, as we have seen in Sec. 9.3, [cf. (9.3-23)], 
we can compute a triangular decompositon of some permutation of A — Al 


P(A — AI) = LU (10.2-25) 
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in O(n), O(n?), or O(n*) operations, depending on whether A is tridiagonal, 
Hessenberg, or general, respectively. We now wish to solve 

(A — Al)z, = P~'LUz, = Vo (10.2-26) 
for some initial vector vy which does not have an excessively small compon- 
ent in the direction x. In the absence of any additional information about x, 
this may be quite difficult to ensure. One way which has been very successful 
in practice, at the same time saving some computing effort, is to choose v, so 
that 

LC 'Pyo=u=([i, 1, ..., 1]? (10.2-27) 
Hence, we start our iteration by solving Uz, = u. After normalizing z,, as 
above, we apply one more iteration involving a forward and back substitu- 
tion and accept z, as our eigenvector x, which we can normalize in any way 
desired. Except in the case of pathologically close roots, z, 1s a very accurate 
approximation to x. 

One apparent drawback to this method is that if A is a multiple 
eigenvalue, we shall be able to compute only one corresponding eigenvector. 
However, we can usually compute a different eigenvector if we start with a 
perturbation of 1 by as little as three units in the last significant digit. This 
may affect the speed of convergence in that more than two iterations may be 
necessary, but the accuracy of the computed eigenvector will not be affected. 
Alternatively, we may choose vy, to be a vector whose components are 
(pseudo-) random numbers between —1 and 1. Then, if A is a multiple 
eigenvalue, different choices of vo will generally yield linearly independent 
eigenvectors. 


Example 10.3 Apply inverse iteration to find the eigenvector x corresponding to the 


computed eigenvalue 4 = — 256.01 of the matrix 
— 120.0 90.86 0 0 
A 90.86 —157.2 — 67.59 0 
0 —67.59 —124.0 46.26 
0 0 46.26 —78.84 


using 5-digit floating-point decimal arithmetic. 
Since A — AI is the matrix of Example 9.3, the triangular factorization of A — Al is 
given in that example [cf. (10.2-25)] with 


1 0 0 0 
_ 0 1 0 0 
L= 0 0 1 0 
66804 —.56387 .14800 1 
136.01 90.860 0 0 
y= 0 —67.590 132.01 46.260 
0 0 46.260 177.17 
0 0 0 — 13653 
100 0 
_10 01 0 
P= 000 1 
01 0 0 
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Setting u = (1, 1, 1, 1)’, we solve Uz, = u by back substitution and find 
z, = [—33.254 49.790 28.067 —7.3244]7 

We then solve the system LUz, = Pz, as in Example 9.7 to get the final solution 
Z, =[—2953.3 4420.5 2491.5 -650.59]? 


The value 4 is an approximation to the true eigenvalue of A, A = 256.00. The eigen- 
vector corresponding to J is 


x =[-—.66810 1.0000 .56363 —.14718]7 
A similar normalization of z, yields the vector 
x =[-—.66809 1.0000 .56362 —.14718]" 


which agrees with x to within rounding error. 


10.3 THE EIGENVALUES AND EIGENVECTORS 
OF SYMMETRIC MATRICES 


Our object here is to develop methods which will enable us to compute all 
the eigenvalues and eigenvectors when the matrix A is symmetric, as we shall 
assume it is throughout this section. The three methods we shall consider all 
use as their basic tool orthogonal transformations of A. In particular the first 
of these three methods has its theoretical basis in Corollary 10.2 of 
Sec. 10.1-4. 


10.3-1 The Jacobi Method 
Corollary 10.2 assures us that there exists an orthogonal matrix Q such that 
QTAQ =D (10.3-1) 


where D is a diagonal matrix with the eigenvalues of A on the diagonal. 
Our technique will be to find a sequence {S,} of orthogonal matrices with 
the property that 


lim S,S, oes S; = OQ (10.3-2) 


k- 00 
We shall use the notation 
T, = STST_, ++ STAS,S, °° S, To = A (10.3-3) 
We denote the elements of 7, by ¢{ and of S, by s{?. We define 


m= VPP k=0,1,... (10.3-4) 
arose 
and m=> LPP k=0,1... (10.3-5) 
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Thus w, is the square of the Euclidean norm of T;, and v, is the sum of the 
squares of the off-diagonal elements of 7,. Our object is to choose the 
sequence {S,} so that 


Wr+ 1 Wy Op + 1 < U, for all k (10.3-6) 
and lim v, = 0 (10.3-7) 
k- 0 
in which case lim 7, = D (10.3-8) 
k- a 


Let t% " be a nonzero off-diagonal element of T,_,. We wish to choose 
S, so that 2%? = 0. In doing so we shall show that (10.3-6) is satisfied. Let 


(k) — fk) — (k) — 

Sop = Sqq = COS 4 sii = 1 ixAporg (103-9) 
(2) ee (3 (k) 

Soq = —Sqp = Sin O, sij = 0 otherwise 


where 6, will be chosen below so that the pq element is annihilated; that is, 
tt) = 0. The orthogonal matrix defined by (10.3-9) is called a plane rotation 
matrix because the linear transformation defined by S, consists of a rotation 
of the axes of the pth and qth coordinates through an angle 6,. From 
(10.3-3) 

T, = 5} T,- 1S; (10.3-10) 
so that we have, using (10.3-9), 


(k) — pk-1) — pk) gg 
tj = ty’ cos 6, — ty; ”’ sin 6, 


j#porg 10.3-11 
t) =~" sin 0, + t%—) cos 6, ( ) 
th) = t*— 9) cos 6, — t&— sin 6 

? ‘ so" ; ixporg (10.3-12) 


fH = -» sin 0, + t{{” ) cos 6, 

t) = t¥—) cos? 6, + t%-) sin? 6, — 2t%— sin 0, cos 4, 

tH = te ) sin? 6, + t*~" cos? 6, + 2t% ) sin 6, cos 6, (10.3-13) 
tH) = 1 = 500) — &—)) sin 20, + th» cos 26, 


= tho iAxAD FG (10.3-14) 


Now we shall choose 6, to make ¢*? vanish. From the last equation of 
(10.3-13) we have 

(k-1) _ lk 1) 
a = cot 26, = ae " (10.3-15) 
so that a 6, always exists. In practice we do not calculate 6, itself but only 
sin 6, and cos 6, since they are all that are required by (10.3-11) to (10.3-13). 
A convenient way to calculate sin 6, and cos 6, is as follows. Let T = tan 6,. 
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Then using the identity tan? 9 + 2 tan 6 cot 26 — 1 = 0, choose T to be the 
smaller root in magnitude of the equation 
t?+2Atr—1=0 (10.3-16) 


Thus using the form of the quadratic equation formula with the discriminant 
in the denominator, we have 


1 
= —____— sign (A 10.3-17 
|A] + /1 + A? (4) ( 
Thent C = cos 0, = = S = sin 6, = TC (10.3-18) 


Equations (10.3-11) to (10.3-13) can be reformulated {22} to yield the 
following set of equations which give more accurate results in computation 
by expressing all elements of T, as perturbations of elements of 7,,_ ,: 


@) — pk) () — pk 1) (k) 
tpaltpp hh tg =tyg (Oth thy =0 


(=e = PSHM + ee) (1003-19) 
{= *) = te 1) + S(t 1) _ tt ')) 
S 
where T = tan (56, )= 14C and h= Tt, 1) (10.3-20) 


Using (10.3-11) to (10.3-13), we can easily calculate that, independent of 
Ons 


(06)? + (tO)? = (PP + (GP)? FG # por | 
(()? + (YP = (tk PP + (CK YP ixporg 

and = (t03)? + (the)? + (tha)? + (tp)? 
= (65D)? + (ee D)? + (HP)? + (CP)? (10.3-22) 


But with 9, chosen as in (10.3-15), 4? = c = 0. Therefore, (10.3-21) and 
(10.3-22) together with (10.3-14) show that W,-1 = Ww, and (10.3-22) implies 
that v, < v,—1. In fact, (10.3-22) implies that since r% ) = c%», because all 
the 7, are symmetric, the off-diagonal sum of squares is reduced by 2(¢%) »)? 

and the sum of squares of the diagonal elements is therefore increased by a 
like amount. At each stage of the Jacobi iteration we (1) choose a nonzero 
off-diagonal element; (2) calculate sin 6, and cos 6, from (10.3-18); (3) calcu- 
late those elements in 7, which differ from those in 7,_, using (10.3-19). 


(10.3-21) 


+ Note that (10.3-17) and (10.3-18) always result in a rotation angle such that |6,| < 7/4 
(why ?). 
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Note that an off-diagonal element made zero at one stage will generally 
become nonzero at some later stage (why?). 

This process can be continued for as long as is desired. The convergence 
of the method can be proved for a variety of techniques, including those 
considered below, for choosing the nonzero off-diagonal element at each 
stage. For some techniques the proof is quite difficult, but in the case where 
the off-diagonal element of greatest magnitude is annihilated at each stage, 
the proof is quite straightforward {23}. The most common convergence 
criterion used in practice is to require that the off-diagonal norm ./v, be less 
than some preset tolerance (cf. {24}). 

The eigenvectors of A are also readily calculated using this iteration. We 
noted in Sec. 10.1-4 [cf. (10.1-38)] that the columns of the orthogonal matrix 
used to reduce A to diagonal form are the eigenvectors of A. To calculate 
the eigenvectors we must then calculate the product of the S, matrices in 
(10.3-2). To see how this can be done simply we write (10.3-3) as 


T, = RTAR, (10.3-23) 
with R, = S,S, ves S, (10.3-24) 
Thus Riuvi = R, S441 (10.3-25) 
and from (10.3-9), we can then calculate the elements of R,4, = [r@*"] in 
terms of those of R, = [rf?]: 
rik) = r® cos 6, — r sin 0, 
r{k*") = rl sin 0, + r® cos 6, (10.3-26) 


i =P | oj #porg 
starting with R, = §. 

The one problem remaining in the use of the Jacobi method is how to 
choose the off-diagonal element to be annihilated at each stage. We should 
like to choose that element of greatest magnitude since this would result in 
the greatest reduction of the off-diagonal norm J, at every stage. For hand 
computation this is easily done, but on a digital computer this requires 
a search through all the off-diagonal elements (actually just through half of 
them since the matrix is symmetric at every stage). Thus, it is more conve- 
nient to annihilate all the subdiagonal elements in serial order, starting with 
the (2, 1) element, proceeding down the first column, continuing with the 
(3, 2) element, etc. This is called the serial Jacobi method and can be shown to 
converge provided the rotation angle 0, satisfies |0,| < 7/4, as indeed it 
does in our formulation. However, since the annihilation of a small subdi- 
agonal element may not be worth the computational effort invested, a 
procedure called the threshold Jacobi method may be used. The essence of this 
procedure is to search through the off-diagonal elements until an element of 
magnitude greater than some threshold value is found. This element is then 
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annihilated. This method assures some minimum reduction of the off- 
diagonal norm at each stage. The details of this method, in particular how to 
choose the threshold at every stage, are considered in a problem {24}. 


Example 10.4 Apply the Jacobi method to the matrix A of Example 10.1. 
The largest off-diagonal element is 1.0 in the first row and second column (p = 1, 
q = 2). Using (10.3-17) and (10.3-18), we have 


sin 0 d 0 
1 =—-—= an cos 6, = —= 
1 /2 1 /2 
Using (10.3-19) we compute 
3 
2 0 ae 
4,/2 
—1 
T, = STAS, = 0 0 es 
i 1 1 4,/2 
3 —1 
4/2 4,/2 


By continuing this process we obtain 


2.5365258 0 0 
D= 0 — 0166473 0 


0 0 1.4801215 
and, using (10.3-26), 


53148338 —.72120712 —.44428106 
Q = | .46147338 68634928 —.56210938 
71032933 09372796 69760117 


As a check on the results we can compute QDQ", which should equal A. 


The accuracy of the results of the Jacobi method depend upon how 
accurately the square roots leading to sin 0, and cos 0, given by (10.3-18) are 
calculated and on how the roundoff error accumulates. The error analysis of 
the Jacobi method is quite involved and beyond us here [see Goldstine, 
Murray, and Von Neumann (1959) and Wilkinson (1962)], but if the square 
roots are calculated with appropriate accuracy, the method is completely 
stable with respect to roundoff error; i.e. no significant growth of error 
occurs because of roundoff. 

The Jacobi method has been widely applied on digital computers to find 
the eigenvalues of symmetric matrices, some of very high order. Using a 
relatively simple, compact program, it yields all the eigenvalues and, if 
desired, all the eigenvectors. Furthermore, the computed eigenvectors are 
always orthogonal, up to roundoff error, even when there are multiple 
eigenvalues. Its major disadvantage is that it is an infinite iterative method, 
and when high accuracy is desired in the eigenvalues or the diagonal elements 
of A are not large compared with the off-diagonal elements, it may lead 
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to a lengthy computation. In the next two sections, we consider two methods, 
both of which are finite iterative processes, which lead not to the matrix D 
but instead to a new matrix whose eigenvalues are much easier to compute 
than those of a general symmetric matrix A. 


10.3-2 Givens’ Method 


The basis of this method is to use orthogonal matrices not to diagonalize A 
but rather to tridiagonalize A, that is, reduce A to a form in which the only 
nonzero elements are on the main diagonal and the two diagonals directly 
above and below it, as shown in Fig. 10.1. We shall then consider how to 
find the eigenvalues of such a matrix. 

The orthogonal matrices S, used in the Jacobi method had the property 
that only the pth and qth rows and columns of 7,_ , were changed in calcu- 
lating 7, = Sj 7,_ ,S,. If then we arrange the order of the calculations care- 
fully, we should not be surprised to find that—up to a point—we can 
annihilate off-diagonal elements while at the same time keeping previously 
annihilated elements zero. First we note that using S, as defined by (10.3-9) 
(with 0, unspecified as yet) we can annihilate, instead of ¢%&, c&~ with 
r +p or q (and by symmetry ¢%~”) is annihilated also). This follows from 
(10.3-12) by writing 


0 = 0 = 1) sin 6, + t%") cos 4, (10.3-27) 
which will be satisfied if 

sin 6, = —at%" = cos 6, = at&») (10.3-28) 
with a= I/[(e&)? + ( Y)7y'? (10.3-29) 


Let us denote the matrix whose elements are given by (10.3-9), with 
sin 0, and cos 6, given by (10.3-28) and (10.3-29), by the triplet (p, q, r), these 
three distinct integers denoting the row (and column) indices of significance 
in the transformation. In particular consider the sequence of transforma- 
tions 


(pgr)=(2,i4,1) i=3,...,n (10.3-30) 


applied in succession to the original matrix A. That is, each triplet defines a 
transformation which we apply by premultiplying the current matrix by the 
transpose of the matrix defined by (10.3-9) and postmultiplying by the 


Qi, Ay2 
412 422, 423, C) 
7 


C) oo Gn-1,1 
An—ijn  ** Ann Figure 10.1 A symmetric tridiagonal matrix. 
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matrix itself. Denoting this matrix now by S,,,, we have after the first step 


pqr°? 


ST,,AS3, (10.3-31) 


Thus the transformation (2, 3, 1) annihilates the element in the first row 
(column) and third column (row) with p and q in (10.3-9) being 2 and 3, 
respectively. In general the (2, i, 1) transformation annihilates the element in 
the first row (column) and ith column (row). Of equal significance, however, 
is the fact that each annihilated element remains zero. This is because the 
(2, i, 1) transformation changes only the second row (column) and the ith 
column (row) and therefore does not affect previously annihilated elements. 
Therefore, after the sequence of transformations (10.3-30), all elements in the 
first row (column) except the first two (a,, and a,.) are zero. 
Next we consider the sequence 


(p,q,r)= (3,142) i=4,...,n (10.3-32) 


which annihilates the elements of the second row (column) from a24 to a2,. 
By reasoning similar to that in the previous paragraph and use of (10.3-32) 
we can show that {26} the sequence of transformations (10.3-32) leaves all 
previously annihilated elements in the first and second rows (columns) zero. 
The general algorithm is now clear. We apply the sequence of trans- 
formations 


ta. jH2,...,n—1 
9» Y = 9 &s —1 . . 10.3-33 
Rarn=Gbi-N Goi, ( ) 
and in so doing get a new symmetric matrix B with the form 
by cy 
Ci by C2, C) 
B= Cp tee (10.3-34) 
C) Tega 


From (10.3-33) the number of transformations required is (n — 2)(n — 1)/2, 
and we may show that the total number of operations required is of the 
order of $n° plus, of course, the (n — 2)(n — 1)/2 square roots required by 
(10.3-29) {31}. 

In Sec. 10.5-1 we shall consider a method particularly suited to finding 
eigenvalues of tridiagonal matrices, symmetric or otherwise, having real 
eigenvalues. Here, however, we shall show that B has a form which permits 
another effective way of computing its eigenvalues. We consider the matrix 


—b, +A es coe 
—C; —by +A. 
Al— B= acrreee OC (10.3-35) 


—Cn-1 "-—b,, +A 
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If we denote the principal minor of order i in the above matrix by f,,_ (A), we 
can easily show that {27} 


Fn— a+ 1lA) = (A — bis 1) fn-lA) — C? fr-ti- 1)(A) 
i=1,...,n—-1 (10.3-36) 
with f,(A) = 1 and f,- (A) = —b, + A. The characteristic equation is 
fo(A) = 0 (10.3-37) 


We can, of course, solve (10.3-37) by the methods of Chap. 8 [since the 
matrix is symmetric, all roots of each f,_ (A) are real]. But here we shall show 
how our results about Sturm sequences in Sec. 8.9-1 can be applied to this 
problem. 

We assume that no c; in (10.3-34) is zero. For if any c; = 0, the determin- 
ant of the matrix in (10.3-35) can be written as the product of two determin- 
ants of smaller tridiagonal matrices {27} and the results below will apply to 
each. Under this assumption we can show that the sequence fo(A), ...,f,(A) 
forms a Sturm sequence as defined in Definition 8.1; we leave the proof of 
this to a problem {27}. This suggests trying to get some idea about the roots 
of (10.3-37) by applying Theorem 8.3, i.e., by calculating V(a) and V(b) for 
various a and b and thereby determining the number of roots in various 
intervals [a, b]. However, Theorem 8.3 requires the Sturm sequence in 
question to have as its second member the derivative of the first, but f,(A) is 
not f(A). However, if f,(A) has the same sign as f(A) at each zero of f(A), 
then Theorem 8.3 is also valid if the sequence used to compute V(a) — V(b) is 
{ f,(A)} (why?). Thus we wish to prove the following theorem. 


Theorem 10.13 At a zero of fo(A), f(A) has the same sign as f(A). 
To prove this theorem we first prove a lemma. 


Lemma 10.1 The i zeros s;, j= 1, ..., i, of f,-,(A) separate the i—1 
zeros r;,j=1,...,i— 1, of f,-g-1)(A) for i= 2, ..., n. That is, 


—0O < Sy <Py< Sg <rg°°° <P; 1; <5; << © (10.3-38) 


PROOF OF LEMMA 10.1 By induction. For i= 2 this can be proved 
directly using (10.3-36) {27}. Suppose (10.3-38) holds for some i. Then, 
using property 2 of a Sturm sequence at each s,, the function f,- (+ 1)(A) 
has sign opposite to f,-@—1)(4). But (10.3-38) implies that the sign of 
Sn-(- (A) alternates from one zero of f,_ (A) to the next. Thus between 
s, and s;, fn-+1)(A) has i — 1 zeros lying between the zeros of f,_ (A). At 
Si, fn-G+1(A) has sign opposite to f,-q@- (A), but as A— —oo, both 
Sn-@-1(A) and f,- @+1)(A) have the same sign since they differ by degree 
2. Therefore, f, — (+ 1)(A) has a zero between — oo and s, and bya similar 
argument it has a zero between s; and + 00, which proves the lemma. 
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PROOF OF THEOREM 10.13 Since the zeros of fo(A) and f,(A) separate 
each other, the zeros of f(A) are distinct. [Note that even though B 
may have multiple eigenvalues, f,(A) has only simple zeros. This is so 
since some c,; = 0 for there to be multiple eigenvalues, and we have 
assumed above that we have decomposed B into submatrices in each 
of which no ¢; is 0]. Therefore, between each two zeros of fo(A), fo(A) 
also has a zero. Moreover, for sufficiently large A, f(A) and f,(A) have 
the same sign since both are polynomials with leading term 4"~'. 
Therefore, they both have the same sign at the largest zero of f,(A). 
The argument above then guarantees that they have the same sign at 
every zero of f,(A), which proves the theorem. 


To compute the value of V(x) requires the evaluation of each f,,_ ;(A) at 
A =x, but this is easily accomplished using (10.3-36). However, there is a 
danger that when there are clusters of eigenvalues, the sequence f, _ (A) may 
have some very small numbers for values of A close to such eigenvalues. 
Now, if underflow occurs in the computation of these numbers and they are 
replaced by 0, the sign determination may be incorrect. To avoid such a 
situation, we work with the quotients 


Sn-i-1(A) 
(A) = 10.3-39 
Gn-i(A) FAA) ( ) 
which satisfy the recurrence formulas 
2 
C: 
(A) = —b, +A ,- (A) = (A -— 6:4.) - ————— i=1l,....n-—1 
nk ) 1 q ( ) ( +1) Qn—i+1(A) 
(10.3-40) 


Then V(A) equals the number of negative q,,_ (A). Now (10.3-40) in itself does 
not give a process free of problems. However, by prescaling our matrix and 
modifying the recurrence appropriately, we get an algorithm which will 
almost always work. 

Let m and M be respectively the smallest and largest floating-point 
numbers which can be represented in the computer and define 


t= m'/4M- 1/2 


We now find o which is the largest power of the base of our number system 
such that a|b;|, o|c;| << tM, i=1,...,n. We work with the matrix with 
elements ab; , oc,;, which we rename b,, c;. Note that the multiplication by a 
does not introduce any roundoff error. The eigenvalues of this new matrix 
will be o times the original eigenvalues, so that we must remember to 
“unscale” at the end. As a result of this prescaling we have that |B]. < 
3M, from which it follows that p(B) < ||Bl|,, <3tM so that 1 < 3M for 
any A for which we compute V(A). We are now ready to start the recurrence 
(10.3-40), modified as follows. If, for any i, i= 1,..., m, |qn—i+1(A)| < \/m, 
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set dn—i+1(A) = —./m. This modification can be interpreted to mean that we 
have perturbed our matrix slightly by working with b; + €; with |e;| < 
2./m. Note that |¢;| is usually much smaller than roundoff error. With this 
modification, we cannot have overflow since cP /./1 m<t?M 2/./m = M and 
the addition of A — b;, , leaves the result less than or equal to M (why?) and 
clearly underflow does not cause any problems either. Finally, a vanishing 
c? does not cause any trouble since we can look at c?/q,- ;4 , = 0as resulting 
from an underflow with c; # 0, so that we need not worry about decom- 
posing our matrix into submatrices but can work all the time with the scaled 
original matrix. 

Therefore, after the reduction of the matrix to tridiagonal form, the 
calculation of the eigenvalues of the tridiagonal matrix can be accomplished 
without too much difficulty using the properties of Sturm sequences in 
conjunction with one of the methods of Chap. 8. In particular, the method of 
bisection (see Sec. 2.2), although somewhat slow, is a very convenient method 
for calculating the eigenvalues. We proceed as follows. Evaluate V(A) 
for the set of functions (10.3-40) for a sequence of values of A. Whenever 
two values, 4, and A,, are found such that V(A,) # V(A,), then there is an 
eigenvalue between A, and A,. By evaluating V[(A, + A,)/2] the interval 
in which A lies can be halved, and by continuing in this way the interval 
can be made as small as desired. 

Since the tridiagonalization is a finite iterative process, the error in 
generating B can be strictly controlled, and then the eigenvalues of B can be 
determined with arbitrary accuracy. Thus Givens’ method is generally prefer- 
able to the Jacobi method. Only for matrices where the diagonal terms 
dominate would we expect the Jacobi method to be competitive with 
Givens’ method (why 7). 

The eigenvectors of A bear the same relation to those of B as in the 
Jacobi method. That is, by keeping track of the successive orthogonal trans- 
formations as in (10.3-24) to (10.3-26) the eigenvectors of A can be calculated 
from those of B {28}. Note that the eigenvectors of B can be computed using 
the method of inverse iteration given in Sec. 10.2-2, which is very efficient for 
tridiagonal matrices. 


Example 10.5 Apply Givens’ method to the matrix A of Example 10.1. 
The only off-tridiagonal element is a,,, So that there is only one transformation to 
perform. From (10.3-28) and (10.3-29) with p = 2,q=3,r=1 


1 2 
a= — sin @ = — —= cos @ = —= 
J Bs J 
Using these to get the matrix S of (10.3-9), we then calculate 
1 V5 4 
2 
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Then from (10.3-36) we get 


f(Ay=1  f,(A)=a-l 
f(A) = 42? —2.40 4.15 — fo(A) = A> — 4A? + 3.68754 + 0625 


with f,(A) the characteristic equation of A. 


10.3-3 Householder’s Method 


This method, a variation of Givens’ method, enables us to reduce A to tridi- 
agonal form with about half as much computation as Givens’ method re- 
quires. The technique is to reduce a whole row and column (except for the 
tridiagonal elements) to zero at a time. 

Let v be a vector such that 


viv=1 (10.3-41) 
Then it is easy to show {29} that the matrix 
P =I —2vv7 (10.3-42) 


is orthogonal and symmetric. In particular we choose v, to be a vector whose 
first k — 1 components are zero, so that 


vy. =[0 0 =: O of vETD «uf? (10.3-43) 
Then with 
P, = 1 —2v,v7 (10.3-44) 
we define 


A, = PT A,-1P, k=2,...,n—1 A,=A (10.3-45) 


Now suppose that the symmetric matrix A,-_, = [g,;] has zeros in its first 
k — 2 rows and columns except for the tridiagonal elements: 


911 912 0 even ee een eeees cee een ee eee eee e eens 0 
9:2 922 923.. **: 
0 ( ; 
Ay, = DoT te . ) ., 
Du-2,.k-2 Gu-2e-1 Ot 0 

row k — | Gk - 2 k-1) Ge-ik-1s 77 Jk—1,0 
creer cee eeeeeeeeenees 0 De-tnr Dun 
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The matrix P, has the form 


1 7 '@) : : cece emer reer werner ene e snes 0 
Twa] 0 : (10.3-47) 
rs | 2 0 0 
row k 0 1- 2(v)? Leeeee —2yy 
Qeeeees 0 _ 2v™y veces pe 2(v)? 


Using (10.3-45) to (10.3-47) we can verify that A, has zeros in the positions 
shown as zero for A,_, in (10.3-46). Our object is to choose the n — k + 1 
numbers vf, ..., vf” to satisfy (10.3-41) and so that the n — k off-tridiagonal 
elements in row (column) k — 1 of A, are zero. 


We define 
$= Soh, (10.3-48) 
jak 
and then let 
(f°)? = a i+ "75 (10.3-49) 
Gi) _ 9x- 1,j - 
and vi) = + 20 /s j=k+1,...,n (10.3-50) 


where the plus or minus sign will be chosen below. The motivation for 
(10.3-49) and (10.3-50) can be found in the algebra leading to the proof that 
the desired n — k elements in the (k — 1)st row (column) of A, are zero and 
that (10.3-41) is satisfied. We leave this algebra to a problem {29}. Proceed- 
ing as above at each step, we arrive at a tridiagonal matrix A,_ . 

The accuracy of this method depends naturally on the accuracy of the 
matrices P,, and these in turn depend upon the accuracy of the components 
of (10.3-43). The key to making this accuracy as great as possible is to make 
v{" as accurate as possible, which we do by choosing the sign in (10.3-49) to 
be that of g,-, ,, thus avoiding a possible loss of significant figures resulting 
from the subtraction of almost equal quantities. We then use the same sign 
in (10.3-50). We leave to a problem {31} the result that the total number of 
operations is of the order of n°, compared with $n° for Givens’ method. At 
each stage it would appear that two square roots are required, one for /S 
and one for [(v{")?]'/?. However, by arranging the calculations properly, the 
latter of these two need not be calculated {31}. Therefore, Householder’s 
method requires n — 2 square roots compared with (n — 2)(n — 1)/2 for 
Givens’ method. For large matrices, then, Householder’s method is a more 
efficient way than Givens’ method to reduce a symmetric matrix to tridiag- 
onal form. The discussion in Sec. 10.3-2 on finding the eigenvalues and 
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eigenvectors of tridiagonal matrices also applies here. Calculation of the 
eigenvectors of A from the eigenvectors of a tridiagonal matrix found using 
Householder’s method is considered in a problem {30}. 


Example 10.6 Apply Householder’s method to the matrix of Example 10.1. As in Example 
10.5, there is only one step to perform. We have 
S=1?4+(4P=3 /S 1.11803 
Since a, = 1, we choose the + sign in (10.3-49) and get 
vp) = At + ‘iaa} | ” = 97325 of & — nt = .22975 

2 1.11803 4 x (1.11803) x .97325 
(In fact the calculation of the square root required to get v) can be avoided, as mentioned 
above {31}.) The best way to proceed with the computation is to note that 

A,-1P, =A,-1 — 24,4, — with w, = A,_1Y, 
and then to use the result {30} 
A, = PEA, ,P, = Ay; — 2% 4, —2q,¥, — with q, = w, — (v/ w,)v, 

In this example then we compute 


w? = vl A = [1.08813 1.03069 .70281] 


Then yyw, = 1.16459 and q} =[1.08813 —.10275 .43525] 
1 — 1.11803 0 

Finally A, = | — 1.11803 1.40000 — 55000 
0 — .55000 1.60000 


which except for roundoff and some sign changes is the same matrix as B in Example 10.5. 
Because only a single orthogonal transformation is needed in this case, we expect Givens’ 
and Householder’s methods to lead to essentially the same tridiagonal matrix. But for 
higher-order matrices, this will not be the case. Note the desirability of computing double- 
precision scalar products in this method in order to minimize roundoff. 


10.4 METHODS FOR NONSYMMETRIC MATRICES 


Our approach to nonsymmetric matrices will be similar to that of Givens’ 
and Householder’s methods in the sense that we shall perform a series of 
transformations on the matrix A—similarity more often than orthogonal— 
in order to reduce A to a matrix B with the same eigenvalues as A but whose 
eigenvalues are more easily calculable. In the method to be considered in 
Sec. 10.4-1 the matrix B will be tridiagonal, as in the case of Givens’ and 
Householder’s methods. In Sec. 10.4-2 we shall consider techniques for re- 
ducing A to a matrix B which has the form shown in Fig. 10.2. Such a matrix 
is said to be in supertriangular or, more commonly, in lower Hessenberg 
form (whereas the transpose of such a matrix is said to be in upper Hessen- 
berg form). 
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b, 1 b 12 
bar ba bps C) 
b,- ln 
bay tte ban Figure 10.2 A matrix in lower Hessenberg form. 


A vital aspect of the calculation of the eigenvalues of nonsymmetric 
matrices is the stability of the calculation with respect to the growth of 
roundoff errors. The details of the roundoff-error analysis of the methods we 
are about to consider are beyond the scope of this book. We shall content 
ourselves with making some general comments on this matter in what fol- 
lows; for more details the reader is referred to the references mentioned in 
the Bibliographic Notes. 


10.4-1 Lanczos’ Method 


Our object here is to construct a matrix S such that the result of the similar- 
ity transformation 


T = S-'AS (10.4-1) 


is a tridiagonal matrix T. Our approach will be to construct a sequence of 
vectors X,, ..., X, Which, as the columns of S, achieve the desired result. In 
the construction of this set of vectors, we shall also construct another set, y,, 
..+, Y,, such that the sequences {x;} and {y,} are biorthogonal; that is, 


yjx,=0 ifi¥] (10.4-2) 
The two sequences of vectors are generated by the following recursion 
Xe = AX, — Dy Xe — Cy 1Xe-1 3 
k=1,....n-—1 10.4- 
Yer = Ay, — dE Yn — Cu 1Ye-1 ( 


with Xp = yo = 0, x, and y, arbitrary vectors and with the b,’s and c,’s 
defined by 


b _ Ye AX, 
k — T 
“mM k=1 1 (10.4-4) 
=1,..., n- . 
oa Net AMe _ Ye co = 0 


T ey 
Ye—1X%e-1 Ye—1%k-1 


Now clearly this recursion requires that y}x, # 0, j = 1,...,n — 1. For the 
present let us assume this is true. We can then prove the following theorem. 
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Theorem 10.14 For the system of vectors defined by (10.4-3) and 
(10.4-4), Eq. (10.4-2) holds for i, j = 1, ..., n. 


PROOF By induction. We have 
y{ Ax, 


x 
T 1 
Yi X1 


X2 => AX, — b,x, = AX, — 


from which we have immediately y/x, = 0. Similarly we get x7y, = 0. 
Now suppose (10.4-2) holds for all i and j less than or equal to k. We 
have from (10.4-3) and (10.4-4) 


TAX Yr, AX 
Yi X41 = Yj AX, — Vix, YjX, — re yjX,-1  (10.4-5) 
For j = k (k — 1) the first term on the right-hand side cancels the second 
(third) and the other term is zero by the induction hypothesis. For 
j <k—1 the last two terms are zero by the induction hypothesis. The 
remaining term is y) Ax, =xj,A’y,. For j<k—1 this term can be 
shown to be 0 by multiplying the second equation of (10.4-3) with k = j 
by xj and using the induction hypothesis. Since we can similarly show 
that xfy,,, =0 for j = 1,..., k, the theorem is proved. 
The second form of c,_, in (10.4-4) follows using the second equa- 
tion of (10.4-3) and the biorthogonality. From this theorem we obtain a 
corollary. 


Corollary 10.3 Under the assumption that yjx; #0,j=1,..., 2, 


1. The vectors x;, i= 1, ..., n, are linearly independent. 
2. If (10.4-3) is used with k = n to generate x,,,, then x,,, = 90. 


Proor If, for some j <n, x, were a linear combination of the x,, 
k <j, we would have 


j-1 
XxX; => y X X; (10.4-6) 
k=1 


But this would mean that y}x,; = 0 because of the biorthogonality. Since 
this contradicts the hypothesis, we have proved the first part of the 
corollary. To prove the second part we note first that the proof of 
Theorem 10.14 implies that if (10.4-3) is used with k =n and y,/x, # 0, 
then Eq. (10.4-2) still holds if the vectors x,,, and y,4, are included in 
the sequences. But then the first part of the corollary would also hold 
for the vectors x, ..., X,+ 1. Since n+ 1 vectors cannot be linearly 
independent in n space, the only way out of this impasse is to have 
Xn+1 = 0. 
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Using this corollary we rewrite the first equation of (10.4-3) as 
Ax, = X, + b,x, 
AX, = Xp41 + OLX, + Ce 1Xk-1 k=2,...,.n—1 (104-7) 
AXy = byXq + Cy—-1Xn—1 
Thus, if S is the nonsingular matrix with columns x,, ..., x, it follows from 


(10.4-7) that 
by Cy 


Lb CO) 
( ) eg 
1b 
Therefore, the tridiagonal matrix T of (10.4-8) is the desired matrix of 
(10.4-1). To find a tridiagonal matrix with the same eigenvalues as a general 
matrix A, we need then only calculate the b,’s and c,’s given by (10.4-4) 
starting from arbitrary x, and y,. 
Since the method of Lanczos, which is also called the method of mini- 


mized iterations {37}, is not used in practice because of its numerical insta- 
bility, we shall not pursue this method further. 


= ST (10.4-8) 


Example 10.7 Use Lanczos’ method to convert to tridiagonal form 


2 —-2 3 
A=]1 I l 
I 3 -l 


With x7 = y{ = (0, 0, 1) we calculate 


(Ax,)"=[3 1 —1] (Aty,)'=[1 3-1] 
b, =-t=-1 
x7=[3 1 0] yr=[1 3 0] 
(Ax,)’ =[4 4 6] (ATy,)’ =[5 1 6] 
b,=8 c, =6 
x3=[-4 3 0] y3=[3 —7 0] 
(4x,)" =[-9% -$ 0] (47y,)"=[-F -¥ 0) 
b, =} c, = —48 
EF 6 ] 
so that T = 1 § _28 
0 i 4 


10.4-2 Supertriangularization 


If Givens’ or Householder’s method is applied to a nonsymmetric matrix 
A, the result is a Hessenberg matrix, as shown in Fig. 10.2 (why?). However, 
it is possible to reduce a general matrix A to Hessenberg form with con- 
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siderably less computation than is required by the methods mentioned 
above. This can be achieved by using Gaussian elimination instead of ortho- 
gonal transformations. Moreover, by using partial pivoting, this process 
can be made extremely stable with respect to roundoff error. 

As in Givens’ and Householder’s methods, we proceed row by row to 
make the off-tridiagonal elements (in the upper triangle) zero. Suppose we 
have done this for rows 1, ..., k — 1. For convenience let the elements in the 
matrix at this stage still be denoted by a,;. Then for row k we 


1. Select the largest element a,,; in magnitude among a 441, .--, Qj, and 
interchange columns k + 1 and I. 
2. Calculate 


m,=—-— ti jHkt2...,n (10.4-9) 


j 
Ay k+1 


whete, because of step 1, |m,;| < 1. 
3. Add m,; times column k + 1 to column j, j =k + 2,...,n 


Step 1 and each part of step 3 are equivalent to postmultiplying the matrix 
by elementary column matrices {38}. Therefore, to complete the similarity 
transformation for row k, it is only necessary to premultiply by the inverses 
of these (nonsingular) elementary matrices {38}. It is quite easy to see that 
the zero elements in rows 1, ..., k — 1 remain zero. Therefore, performing 
this algorithm for k = 1,...,n — 2 results in a matrix B = [b,,] in Hessenberg 
form. The stability of this method with respect to roundoff results from the 
fact that the m,,; in (10.4-9) are all no greater than 1 in magnitude. 

The number of multiplications and divisions required is $n? + O(n’) 
while Householder’s method for nonsymmetric matrices requires 
$n° + O(n?) operations plus n — 2 square roots {38}. 

In Sec. 10.5-2 we shall consider a powerful method for calculating all the 
eigenvalues of a matrix in Hessenberg form. In the remainder of this section, 
however, we shall consider how we can calculate an eigenvalue and eigen- 
vector of such a matrix. We assume throughout the remainder of this section 
that no b; ;,, is 0, for if so we can then consider a reduced matrix {39}. 

The system of equations (B — AJ)x = 0 can be written 


(by, — A)x, + by2X2 =0 
bay X1 + (b22 — A) x2 + b23%3 =0 (10.4-10) 


Dar X4 + bya X24 + (Din — A)X, = 0 


Now let us assume a value of A and set x, = 1 in the first equation. The first 
n — 1 equations can then be solved recursively for x,,..., x,. But the last 
equation will be satisfied only if A is an eigenvalue (why?). Denote the value 
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of the left-hand side of the last equation by F(A). We shall now show that 
F(A) is a multiple of the characteristic equation. 
The matrix B — AI has the form 


by, —A by» 
bas bap A bas 
B-A= : Tete (10.4-11) 
, Da 1.1 
bay rn Din 


For i = 2, ..., n we multiply the ith column by x, as found above and add 
this to the first column, thereby obtaining the matrix 


by. —A b23 
a O (10.4-12) 
0 ": by-1 n 
F(A) bigtittttttee ‘ba 
We have then 
n- 1 
|B _ Al | = F(A) |] 5: ia => cF(A) (10.4-13) 
i=1 


where c is nonzero since we have assumed no 5; ;, , = 0. Since our object is 
to find a zero of F(A), we can use one of the techniques of Chap. 8 to find this 
zero. Since the eigenvalues of B may be complex, an appropriate choice is 
Muller’s method (Chap. 8, Prob. 21), which can find a complex root starting 
with real approximations and which has good convergence properties. Alter- 
natively, we could compute one or two derivatives of F(A) and use the 
Newton-based method of Sec. 8.12 or Laguerre’s method (Sec. 8.10-4). 

To compute F’(A), we differentiate (10.4-10) with respect to A, taking into 
account that the x;, i= 1, ..., m, are functions of A. We get the system 


(by, — A)x, — Ax, + by 2x5 =0 
b21 xX, + (b22 — A)x2 — Ax2 + b23x3 = 0 


Lobe cbc bees esse evbnn sees teeenas (10.4- 14) 
Bat 4 + by2X> tet (Dan _ A)x;, _ AXn = 
Setting x, = 1, x, = 0, we solve (10.4-14) successively for x2, ..., x,, using the 
previously computed values x,, ..., x,.. The value of the left-hand side of the 


last equation is then F’(A). We proceed similarly for F”(A). 
It might seem that the accuracy of this procedure would be severely 
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curtailed if any of the b; ;,, are very small in magnitude because of the 
necessity of dividing by 5; ;,, in the recursion used to solve the first n — 1 
equations of (10.4-10). But, in fact, it can be shown that there is little correla- 
tion between the accuracy of the method and the magnitude of the b; ;,,, 
i=1,...,n— 1 [see Wilkinson (1959a)]. 

Having computed an eigenvalue of B by the procedure above, we com- 
pute the corresponding eigenvector by inverse iteration (Sec. 10.2-2), which 
is quite efficient for Hessenberg matrices. 

Once we have computed one eigenvalue 1,, we apply an implicit 
deflation and find a zero of F,(A) = F(A)/(A — 4,). More generally, once we 
have computed A,, ..., A,, we seek a zero of 


F (A) = —————_ (10.4-15) 


evaluating F(A) by (10.4-10). The derivatives needed for the Newton-based 
method and Laguerre’s method can be computed using the formulas {40} 


Fay" HD 7 RA-Ay (10.4-16) 
Fo(A)|? Fol) [FA FA) Baye ag, 
haat F,(A) EA F(A) PAG A;)~? (10.4-17) 


Example 10.8 Apply the Gaussian elimination and deflation methods of this section to 
the matrix of Example 10.7. 

Interchanging the second and third columns of the matrix and then eliminating the 
element in the (1, 3) position, we obtain the matrix 


2 3 «0 
i 1 4 (10.4-18) 
1-1 3 


Then premultiplying by the inverses of the elementary matrices used to derive (10.4-18), 


we obtain {38} 
2 3 #0 
p=} -2 4 
1 1 4 


To calculate an eigenvalue of B let us take as initial approximations A = 0, A = —4, 
and A4=4. Then using Eqs. (10.4-10) we calculate F(0) = —48, F(—4) = —482, and 
F(4)= —#. Then using Muller’s method, we obtain as the next approximation 


A = .912537. Convergence to A, = 1 is very rapid (five iterations). 

Now suppose we have found 1, = 1. Then, to calculate the next eigenvalue of B, we 
use the same initial approximations, and, using (10.4-15), we calculate F,(0) = 44, 
F ,(—4) = $3, and F,(4) = 4. Using Muller’s method again, we converge to 1, = —2 in 
one iteration. Similarly, we converge to 2, = 3 in one iteration. 
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10.4-3 Jacobi-Type Methods 


The Jacobi method for symmetric matrices was based on the result that any 
symmetric matrix can be diagonalized by an orthogonal transformation. 
Theorem 10.10 assures us that any nonsymmetric matrix can be triangu- 
larized by a unitary transformation. However, it has not been proved that 
this triangularization always can be accomplished by unitary matrices anal- 
ogous to the plane rotation matrices of Sec. 10.3-1. In fact, for certain 
procedures directly analogous to those in Sec. 10.3-1, examples can be given 
for which the process will not converge. Nevertheless, by successively annihi- 
lating the largest off-diagonal element in one triangle of the matrix or by 
using the threshold technique it has been found possible in many cases to 
triangularize A and thus to find its eigenvalues and eigenvectors {41}. 

A more sophisticated and theoretically sound Jacobi-type method is the 
norm-reducing method of Eberlein. To explain the basic idea of the method, 
we need the concept of a normal matrix, which is a matrix unitarily similar to 
a real or complex diagonal matrix. It follows that a matrix A is normal if and 
only if AA* = A*A {42}. Thus, the class of normal matrices includes all 
unitary and Hermitian matrices. A unitary extension of the Jacobi method 
exists which, when applied to a normal matrix, converges to a diagonal 
matrix. 

Now, there is a theorem which states that for any matrix A, there exists a 
nonsingular matrix P such that P~'AP is arbitrarily close to a normal 
matrix and 


inf ||P~!AP|2 = ¥- |A,/? (10.4-19) 
i=1 


Eberlein’s method in the complex form, which is more efficient than the real 
form when there are complex eigenvalues, constructs a sequence of matrices 
C;, where 


C;4 1 WwW, 1C, W, (10.4-20) 


and the W, are two-dimensional transformations, depending on a pair of 
indices, (k, 1), of the form RS. R is a complex rotation defined by 


Vig = Vy = COS X ret = —r* = e@ sin x rij = Oi; otherwise 
(10.4-21) 
and S is a complex shear defined by 
Sex = Sy = cOSh y Sy = st = —ie* sinh y _ 5;; = 6,; otherwise 
(10.4-22) 


The index pairs are usually chosen in serial order. The parameters x, « of the 
matrix R are chosen so that if C, is Hermitian, they reduce to the ordinary 
unitary Jacobi parameters. The parameters y, f of S are chosen to reduce the 
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Euclidean norm of C;,, in case the (k, I) element of C; deviates from nor- 
mality, i.e. if y,,; = (C*#C; — C;C¥),, # 0. In this case, we have that 


No 


3 | 
Vea | 
IC ]2 — [Cra 2 = * 

IC; || 


(10.4-23) 


tbo 


If y,; = 0, then B = y =0, S =I, and ||C;,, |e = ||C; |p and W, reduces to a 
complex rotation. Thus, as long as the matrix C; 1s not normal, some suc- 
ceeding matrix C;, , is reduced in norm, and as soon as a C; becomes normal 
to machine accuracy, the above-mentioned extension of the Jacobi method 
can be applied. The end result is a complex matrix W = W, W, -:- Wy such 
that the off-diagonal elements of 


C=W-'cCW (10.4-24) 


are arbitrarily small. The diagonal elements of C are approximations to the 
eigenvalues of C, and the columns of W are approximations to the corre- 
sponding eigenvectors. The details will be found in the references given in 
the Bibliographic Notes. 


10.5 THE LR AND QR ALGORITHMS 


The basis of these methods is the successive factorization of a sequence of 
matrices {A,}, all of which have the same form as the original matrix A, = A; 
for example, if A is tridiagonal, so is every A, . The key to these methods is the 
observation that if A, is factored into the product F, G,, where F, is nonsin- 
gular, then if we multiply F, and G, in reverse order, the matrix A, = G, F, 
has the same eigenvalues as A,. This is true because 


A, = G,F, = F,'A,F, (10.5-1) 


so that A, and A, are similar. Now A, itself can likewise be decomposed into 
A, = F,G,, and in this way we define a sequence of matrices 


A, = F,G, = G,-1F y-1 k = 2, 3,... A, =A=F,G, (10.5-2) 


The following properties of the matrices A,, F,, and G, are of interest to 
us: 


1. All the matrices A, are similar and therefore have the same eigenvalues. 
This follows from (10.5-1). 

2. Let E, = F,°:: F,, so that E, is nonsingular. Since, as in (10.5-1), 
Aya, =F, 'A,F,,, it follows inductively that 


A, +1 = £,'A,E, (10.5-3) 
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3. Let H, = G,°*’ G,. Since F;G,;= Gj F 5-4, 
E, Hi, = Fy ++ Fy Fy. Gy Gy °° G 
= Fyre Fy Gg Py 1 Gy 0° Gy = Ey- 1 Ay Ay 1. 


Hence, using (10.5-3) with k replaced by k — 1, E, H, = A,E,_ ,H,-4. 
By repetition of this process we arrive at 


E, A, = A‘ (10.5-4) 


The LR and QR algorithms come from the following two factorizations 
of A. 


LR If we assume that there exists a unique triangular decomposition for 
each A,, A, = L, U,f with L, unit lower triangular and hence nonsingular, 
then F, = L,, G, = U, defines the LR transformation. Such a decomposition 
may not necessarily exist even if A, is nonsingular. In this case, all we know 
is that for some permutation matrix P, PA, = L,U,. Hence, the above 
assumption is nontrivial. 


QR If we assume that it is possible to decompose an arbitrary real matrix A 
into a product OR, where Q is orthogonal and Rf is upper triangular with 
nonnegative diagonal elements, and that when A is nonsingular, this decom- 
position is unique, then F, = Q,, G, = R,, defines the QR transformation. In 
contrast to the LR case, we now show that a QR decomposition always exists 
by constructing it. 

To obtain this decomposition, we apply to A a sequence {P,} of House- 
holder transformations, as in Sec. 9.9, making sure at each stage that the 
diagonal element becomes nonnegative. This is always possible since such 
transformations are determined up to a sign. Thus, we find that 


P,-1Pr-2°°° P} A=R (10.5-5) 
Since each P, is orthogonal, we have that A = QR, where 
Q=[Py1 0 Pil (10.5-6) 


Since each P, is uniquely determined by the nonnegativity condition if the 
diagonal element a‘) is not 0 when P, is applied, it follows that when A is 
nonsingular, the decomposition is unique. 

Since the product of triangular matrices is triangular and that of ortho- 
gonal matrices is orthogonal, we have that E, is the lower triangular or 


+ Its discoverer, Rutishauser (1955), used the mnemonic left-right (LR) whereas, in keeping 
with the notation of Chap. 9, we use lower-upper. 
t Francis (1961), the originator of this method, used R as a mnemonic for right triangular. 
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orthogonal factor of Aj. Hence the convergence of either the LR or QR 
process is determined by the behavior of the sequence {E,}, since 
Agri = E, A, Ey. 


Theorem 10.15 If {E,} converges to a nonsingular matrix E,, as k > 00, 
and if each G, is an upper triangular matrix, then lim,.,,, A, exists and is 
an upper triangular matrix. 


Proor Since {E,} converges, the following limits also exist: 


lim F, = lim E,',E, =! (10.5-7) 

k— 0 k- 0 

G., = lim G,= lim A,4iF, = lim E, 'A,E,-1 
k- oo k> a k- a 
= E;'A,E, (10.5-8) 
Furthermore G,, 1s upper triangular since each G, is. Therefore, 
A, = lim A, = lim FG, = Gy (10.5-9) 
k-@ ko 


exists and is upper triangular, which proves the theorem. 


An investigation of the convergence of the E, in general is beyond the 
scope of this book. However, we shall prove a quite general theorem for the 
QR case, since this is the case of practical importance, and quote some 
results for cases not covered by the theorem. First we make the following 
definition. 

If the sequence {A,} produced by either algorithm tends to upper triangu- 
lar form even though elements above the main diagonal may not converge, 
we say that the algorithm converges essentially. If the sequence {A,} con- 
verges, we say that the algorithm converges. 


Theorem 10.16 Let the real nxn matrix A, =XAX~', where 
A = diag {A,, ..., A,}. If |A,| > [a2] >-+: > |ay| > 0, and if x7’ = Y 
has an LU factorization L,U,, then the QR algorithm converges 
essentially. 


Proor At = XA*X-! = XAFL,U, = X(MIL,A~*)(A‘U,) (10.5-10) 
Now A‘L,A7* = 1 + B, {44}, where 


k 
+) i>j,l,eL, 


J 


(B,);; = z 


0 i<j 


(10.5-11) 
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We thus have that 
Ak = X(I+ B,)(A*‘U,) (10.5-12) 


where, since |A;/A;| < 1,i>/j, B,>0ask— oo. 
Now, by our previous construction, X can be decomposed into 
Q,.R,,, where R, has positive diagonal elements. Therefore, 


Al = Q.R,(I + B)(A‘U,) = O,(1 + R,ByRz!)(RyA*U,) (10.5-13) 


Since B, > 0, 1+ R,B,R,' will eventually become nonsingular and 
hence will have a unique factorization 0, R,, where 0, > 1, R, > 1 as 
k + oo. Thus 


Aj = (Q, O,)(Ri R,, A‘ U,) (10.5- 14) 


This need not be the QR factorization of A‘ since the diagonal of the 
second factor need not be positive because of A* and U,. We therefore 
define diagonal orthogonal matrices D, and D, such that D, A and 
D,U, have positive diagonals. Then the matrix Di D,R, R,.A*U, also 
has a positive diagonal. Therefore, 0,0,D 1D,“ is the orthogonal 
factor of A*. Hence, with Q, 0, Dz 'D;* playing the role of E,, 


Ax+1 = Dt D,Q7Q07A,0,0,D;'D,* > D'D,Q7A,0,D3 1D; 


= D‘(D,R,AR;, 'D;')D{" (10.5-15) 
as k-> oo since 0, >I and A, = Q,R,AR,'Q7. If D‘ converges, so 
does A,. But D, = diag (+1, ..., +1), and so D§ will not converge if 


any diagonal element is negative. However, this has no effect on the 
diagonal elements of R,, AR; ' (nor on the magnitudes of the elements in 
the upper triangle). Hence, we have essential convergence. Furthermore, 
since the diagonal of R, AR, ' = A, we see that the eigenvalues appear 
on the diagonal in decreasing order of magnitude. 


Further results on convergence which we state without proof are the 
following. 

If, in addition to the hypotheses of Theorem 10.16, X has an LU factori- 
zation, then the LR algorithm converges. If Y does not have an LU factoriza- 
tion, then since Y is nonsingular, there exists a permutation matrix P such 
that PY does have such a factorization. It is then not difficult to show that 
the QR algorithm still converges essentially. However, in contrast to the case 
of Theorem 10.16, the eigenvalues do not appear on the diagonal in decreas- 
ing order of magnitude. In fact, the sequence {A,} always converges to block 
triangular form, each diagonal block having roots of equal magnitude. In 
general, this does not cause problems. Only when there are distinct 
eigenvalues of equal modulus, do we have convergence to a form which is 
troublesome. Thus, even with real multiple eigenvalues, the convergence is 
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to triangular form, while for multiple complex-conjugate pairs of 
eigenvalues, the limiting form of A, will yield these roots in a string of 2 x 2 
submatrices along the diagonal. Hence, except for rare cases, the limiting 
form of A, is some modification of the form 


A, x x xX xX x x 

0 A, x x x x 

0 O Am X xX x x 

0 0 0 B x (10.5-16) 
0 0 0 x x 

0 O 0 0 O B 

0 0 0 0 0 


where m + 21 =n, each B; is a2 x 2 real submatrix with complex conjugate 
eigenvalues which are eigenvalues of A,, and the real eigenvalues A; and the 
B, may appear in any order along the diagonal. Cases in which this limiting 
form is not achieved almost never arise in practice since, as we shall see 
below, the OR transformations are modified by shifts, which greatly reduce 
the possibility of distinct eigenvalues of equal modulus. 

Returning to Theorem 10.16, we see that aft) + A; as k > oo. In fact, it 
can be shown [see Parlett (1965)] that 


af? = A; + O(r?) at?) 1,i = O(ri) (10.5-17) 
where sr, = max a “i Ag = 00 Ang =O (10.5-18) 
i-1 i 


which is linear convergence. In particular, r, = |/,/A,-1 |, and our strategy 
is to try to modify the algorithm to make r, very small so that a,,, > A, and 
Ay, n—1 > 0 very rapidly. Looking ahead a little, if the A, were upper Hessen- 
berg or tridiagonal, then once a, ,-, Converged to 0 to within machine 
accuracy, we would have computed J, = a,,,. Then, we could deflate the 
matrix and work only with the matrix of order n — 1 consisting of the first 
n— 1 rows and columns of A,. Thus successive deflations would result in 
less and less work to compute succeeding QR transformations. 

Recalling our discussion of the power method (Sec. 10.2), we know that 
the eigenvalues of A, — pI are A; — p,i=1,...,n,so that if we can choose an 
appropriate value of p, the ratio |(A, — p)/(A,-1 — p)| will converge to 0 very 
rapidly. A good estimate of A, would serve this purpose very well. In order to 
allow us to choose the best estimate at each iteration, we must work with 
A, — p, 1 involving the variable shift p,. This gives us the modified algorithm 


Factor A, — p,/ into F,G, 


k=1,2,... 10.5-19 
Form A,,,=G,F, + pf ( ) 
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Then, as before {45}, 


A,,, =F, 1°: F['A,F, °°: F, = E,'A,E, (10.5-20) 
k 
j=l 


If we define ¢,(A) = | |k=, (A— p;), the proof of Theorem 10.16 will carry 
over to the modified algorithm when J} is replaced by ¢,(A,). Thus, provided 
the p,; are chosen so that |,(A;)| # |@,(A,)|, i #.s, for k > K and so that 
o,(A;) # 0, then the modified algorithm will converge. We defer until later 
our discussion of the choice of p;. 


10.5-1 The Simple QR Algorithm 


As indicated above, we shall restrict our attention to the QR algorithm. We 
do this because the LR algorithm is numerically unstable. This is clearly true 
for the general situation since the LU decomposition of a matrix without 
pivoting may lead to disaster. While it is possible to modify this algorithm to 
include pivoting, the theoretical basis of the convergence proof is lost, and, 
in fact, simple examples can be constructed for which convergence does not 
take place. Even when the LU decomposition without pivoting can be 
justified, as in the case of positive definite symmetric matrices, in practice, 
there is a loss of accuracy. Hence, even though the QR algorithm is more 
expensive than the LR algorithm, its superior numerical properties more 
than make up for the extra labor. 

If we count the number of operations involved in a single FG transfor- 
mation applied to a full matrix, we find that it is of the order of n°. Since 
many iterations may be necessary before convergence, this could be quite an 
expensive task. Fortunately, the workload decreases substantially if we work 
with Hessenberg? or tridiagonal matrices, which we can do inasmuch as the 
OR transformation leaves these forms invariant {46}. The work involved in 
a single transformation is of the order of n?* operations for Hessenberg 
matrices and of n operations for tridiagonal matrices. We shall henceforth 
restrict our discussion to such matrices, where we further assume that all 
elements on the subdiagonal are nonzero, since otherwise we could decom- 
pose our problem into smaller ones {46}. 

We now distinguish between two cases. In this section, we shall deal 
with the case where we know that all the eigenvalues are real. In Sec. 10.5-2 
we deal with the case where some of the eigenvalues may be complex. In the 
first case there is a very efficient scheme for implementing the QR 
transformation. 

Since A, is now assumed to be of Hessenberg or tridiagonal form, we 


f In this section, Hessenberg will always mean upper Hessenberg. 


THE CALCULATION OF EIGENVALUES AND EIGENVECTORS OF MATRICES §27 


need not use Householder transformations to transform A, to upper triangu- 
lar form but can use plane rotations. Recalling the notation of Sec. 10.3-2, 
we apply the sequence of transformations (i, i+ 1, i), i=1,...,n—1 to 
A, — p, 1, where premultiplication by the corresponding matrix S/ ;,,.; an- 
nihilates the element in the ith column and (i + 1)st row. We thus have 


Si- l,nn-1 7° S3, 3, 281, 2, 1(A, —_ Py!) = R, (10.5-22) 
so that Q, = S121 °°' Sy-1,n,n-1, and 
Anas = RySy02,1°°* Shepp + Del (10.5-23) 


Each such transformation takes about 4n’ multiplications and n — 1 square 
roots {46}. The shift p, is determined from the eigenvalues py, and v, of T,,, the 
bottom 2 x 2 submatrix of A,. If both are real, we take p, to be yu, or v, 
according as |u,—a| or |v,—al| is smaller. Otherwise we set 
DP, = Re py. 

If our matrix A, is symmetric and tridiagonal, then since the QR trans- 
formation preserves symmetry {43}, all subsequent matrices A, will be sym- 
metric and hence tridiagonal (why?). In this case, the QR algorithm with 
shifts is very efficient. Thus the combined algorithm of first reducing a sym- 
metric matrix to tridiagonal form by Householder transformations and then 
applying the QR algorithm is probably the most effective way to evaluate all 
the eigenvalues of a symmetric matrix. 


Example 10.9 Apply the simple QR algorithm to find all the (real) eigenvalues of the 
symmetric tridiagonal matrix A, where 


120.0 -—90.86 0 0 

Aad, =| 790-86 157.2 67.59 0 
, 0 67.59 1240 —46.26 

0 0 —46.26 78.84 


The computations below were carried out to about 14 significant figures. We give the 
results rounded to five figures. We give the transformation from A, to A, in full detail. 
After that, we shall only give the shifts, p,, and the matrices A,. 


61061 —.79193 0 0 
79193 61061 0 0 
Py = 49.943 Sta = 0 0 1 0 
0 0 0 1 
114.73 —140.42  —53.527 0 
0 ~6.4629 41.271 0 
ST, (A, —p, l= 
121(Ai ~ Pa!) 0 67.590 74.057 — 46.260 
0 0 46.260 28.897 
i 0 0 0 
gt | 9 095185 99546 0 
232-1 _.99546 —.095185 0 
0 0 0 \ 
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114.73 —14042 —53.527 0 
0 67.898 69.792 —46.050 
ST..ST, (A, —p, 1) = 
292S12i(4y — Pr!) 0 0 ~48.133 4.4032 
0 0 ~46.260 28.897 
i 0 0 0 
Pon 0 0 
343-1 Q —.72099 —.69294 
0 0 69294 —.72099 
114.73 —140.42 —53.527 0 
0 67.898 69.792 —46.050 
Ry = 8443833287191(A1 — Pil) = 0 0 66.759 —23.198 
0 0 0 17.783 
181.26 5.1182 —53.527 0 
Rs... a) ~53771 41.459 69.792 —46.050 
19121 = 0 0 66.759 —23.198 
0 0 0 —17.783 
181.26 —53.771 0 0 
R.S...5,..- | 73771 65.529 -47.914 — - 46.050 
191219232 = 0 66.456 —6.3544 —23.198 
0 0 0 17.783 
181.26 —53.771 0 0 
~53.771 65.529 66.456 0 
Az — PyP = RyS42152325343 = 0 66.456 20.657 12.323 
0 0 12.323 12.822 
231.00 —53.771 O 0 
A,=| 753771 115.47 66.456 0 
2 0 66.456 70.600 12.323 
0 0 12.323 62.765 
251.32  —23.029 0 0 
~23.029 136.29 38.237 0 
P2 = 93.152 Ag = 0 38.237 28.568  —2.8312 
0 0 28312 63.861 
255.18 —9.6151 0 0 
~9.6151 140.163 24.060 0 
p3 = 64.087 Ay = 0 24.060 20.690 ~ 0047934 
0 0 ~ 0047934 64.003 
255.86 —3.9842 0 0 
~3.9842 142.45 14.730 0 
= 64. = 
Pa 003 As 0 14.730 17.730 <10-!° 
0 0 <10-'° 64,003 


We now deflate and continue on A,, the 3 x 3 submatrix of A,. 


255.96 — 2.1128 0 
ps, = 16.014 Ag= | —2.1128 144.07 — 00010338 
0 — .00010338 16.013 
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255.99  —1.1273 0 
pe = 16013 A,= | —1.1273 14404 <107!° 
0 <107'!° 16.013 


We now deflate again and continue on A, the 2 x 2 submatrix of A. 


. 256.00 <107!° 
Py = 144.03 Ag = E 197 '° 144.03 | 


We are now through, and the eigenvalues, read off the diagonal of the full matrix A,, are 
256.00 144.03 16.013 64.003 


Note that they are not in decreasing order of magnitude. 


We see that each premultiplication by Sj ;,,.; affects only rows i and 
i+ 1 and, similarly, each postmultiplication by S,, ;4,, ; affects only columns 
jand j + 1. From this we see that if A is tridiagonal, then R is a band matrix 
of bandwidth [0, 2]. This implies that the storage requirements in this case 
are O(n) rather than O(n?) and similarly for the number of operations per 
iteration. We also notice that each A, is tridiagonal and symmetric, as 
expected. 

We also notice that at the same time that the element in the lower 
right-hand corner of the matrix is converging to an eigenvalue, the other 
diagonal elements are also converging, although at a slower rate. Similarly, 
the other off-diagonal elements are tending to 0 although not as fast as 
A, n—1- Thus, subsequent iterations on the later deflated matrices converge in 
fewer iterations than are required for the earlier ones. 

We point out one final item connected with the organization of the 
computation. We have applied formulas (10.5-22) and (10.5-23) in a straight- 
forward fashion here. This requires saving the matrices S, ;,;;,i=1,..., 
n — 1, or more accurately, the two values, sin 6; and cos 6; which determine 
Si i+1,;- However, we can reorganize the computation to avoid saving these 
values by noticing that, for example, after computing S33, S57,,(A, — p, J), 
the first two columns remain unchanged and no information contained in 
them is needed for the rest of the computation of R,. Hence, we can compute 
$332 S121(A; — Pp, !)S121 before premultiplying by S},, and therefore we 
need not save S,2,; any more. In general, we postmultiply by S;_, ; ;- , after 
the premultiplication by S];.1.;, i= 2, ...,— 1, and finally, postmultiply 
by S,-1,.n,.n-1- Lhis saves about 2n storage locations. 

Further savings in computation and storage can be obtained in the 
symmetric tridiagonal case if we take into account the fact that each A, is of 
the same form. We leave the details as a problem {47}. 


10.5-2 The Double QR Algorithm 


We now discuss the more general situation in which the matrix A, which we 
assume to be Hessenberg, may have complex roots which occur in pairs as 
conjugate-complex numbers. In this case it is important for rapid conver- 
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gence of the QR algorithm that shifts be made with complex numbers. For 
example, if 2, =4,-, =a+ib and 1,_,=4,_,; =a + i(b + ©), € >0, then 
the real number p which minimizes 
r2 = An — P ? 
An+ 2—P | 

is p=a, and the corresponding value of r which gives the best rate of 
convergence is |b/(b + €)|, which is close to 1 for small ¢. This shows that 
the convergence of the QR algorithm may be very slow if we are restricted to 
real shifts. Thus, assuming that 1, = A,_ , is the eigenvalue of smallest mag- 
nitude (we shall see below that the procedure to be described works even if 
A, is real), we wish to drive the element a,_, ,— 2 in position (n — 1,n — 2) to 


zero leaving a 2 x 2 matrix at the bottom of A, whose eigenvalues are A, and 


A,- Since a? , ,-2 goes toO as |A,/A,_ 2 |*, a shift by an approximation to A, 


is called for. However, we would prefer not to leave the real field, if possible. 
This can be accomplished if we follow a shift with a complex p by a shift with 
p. Thus if we perform the following pair of QR transformations 


A, —- pl=Q,R, R,Q,+ pl=A, 


A, — pl =Q)R, R,Q, + pl = Ag (10.5-24) 

it can be shown {50} that 
Ax = (Q:Q2)*A(Q,Q2) = Q*AQ (10.5-25) 
and (Q:Q2)(R2R1) = (Ai — pI)(A; — pl) = Bip) (10.5-26) 


The matrices Q,, Q,, R,, R, will be complex, but Q and R = R, R, are 
real since they correspond to the factorization of the real matrix B(p), so that 
A, 1s also real. 

The double QR algorithm is a method for determining Q and A, without 
computing Q,, R,, Q,, R2, or A. It is based on the following theorem 
about Hessenberg matrices. 


Theorem 10.17 Let H = Q7AQ, where A is any matrix, Q is orthogonal, 
and H is a Hessenberg matrix which is known to have nonvanishing 
subdiagonal elements. Then given A and the first column of Q, H and 
the remaining columns of Q are determined. 


PROOF We shall proceed in an inductive manner to generate H and the 
remaining columns of Q. Denote by h, = (h,;, ha;, ..., hj+1,j,0, ---, 0)" 
and by q,, the jth column of H and Q, respectively. Equating the first 
columns of QH and AQ, we have that 


hy, + h21q2 = Aqy (10.5-27) 
Hence, since Q is orthogonal and q, is known, we can determine 
hy; =Qi Aq, (10.5-28) 
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Once h,, is known, we have 
NoiQ2 = Aq, — 1114) = 42 (10.5-29) 
Since Q is orthogonal, |q,| = 1. Hence we have 
hy, = | q2| qa-=,;, (10.5-30) 


Thus we have determined h, and q,. 
Assume now that we have determined q,, ...,q;,j <n, and h,,..., 
h;_,. We determine h; and q;, , as follows. Since 


j+1 
i= 1 
we find as before that 
hi; = qi Aq, 1 = 1, re | (10.5-32) 
j 

Furthermore hye, jQj+1 = AQG— > hij = Gan (10.5-33) 

i=1 

_ le _ Aja 
from which hyei,j = |Gjei] 9 a = hs (10.5-34) 
j+i,j 

Once we have determined q,, we compute h;,, i= 1, ..., n, using 


(10.5-32). The only point at which this procedure can break down is if 
some h;,,, ; vanishes. However, our hypothesis on H precludes this, and 
the proof is complete. 


This theorem implies that if we can find the first column of Q in 
(10.5-25), then we can find Q and H such that 


A,Q=QH (10.5-35) 


with H = A, as long as no subdiagonal elements of A; vanish. Now, if p is 
not an eigenvalue of A ,, it can be shown that if A, has nonvanishing subdiag- 
onal elements, so does A,. On the other hand, if p happens to be an 
eigenvalue of A,, then since, as we shall see below, we do not compute A, 
using the algorithm indicated by the proof of Theorem 10.17 but a more 
efficient algorithm based on matrix transformations, we shall find that the 
last two rows of A, will be in block triangular form, thus revealing the 
eigenvalues. 

The double OR transformation derives a Hessenberg matrix H from A,, 
which guarantees that H = A,. From (10.5-26) the correct Q is that ortho- 
gonal matrix for which 


QOR=B(p) or R=Q'B(p) (10.5-36) 
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We are interested in the first column of Q or the first row of Q7. Q7 isa 
matrix which converts B(p) to upper triangular form, i.e., to R. Hence it can 
be derived as a product P,_,-:: P, of Householder transformations as 
shown in Sec. 9.9. Now, the first row of Q? = P,_, °:: P,, that is, the first 
column of Q, is the first row of P, {51}. Hence our problem reduces to finding 
the first row of P,. But P, is the matrix which introduces zeros into the first 
column when triangularizing the real matrix B(p) and it is determined solely 
by the first column of B(p). 

Because B(p) is the product of two upper Hessenberg matrices, only the 
first three elements of its first column are nonzero. They are given by 


by, = (ay, — p)(ay, — P) + 44242, = af, — ay, (p + P) + pp + ay2Q, 
bo, = 42,[4,,; + a2. — (p+ B)] (10.5-37) 
b3, = 43242 


Having computed b,,, b,,, and b,, from the values of A, and the shift p, we 
compute P, as in Sec. 9.9 by 


2uu7 
P,=!l!-—;> (10.5-38) 
Jull2 
where 
u = [b,, + sgn (by ,)(b7, + 63, + 631)? bay 3, OO Oj" 


(10.5-39) 


Consider now the matrix P, A, P, (recall that P, is orthogonal and 
symmetric). It can be reduced to Hessenberg form by a series of orthogonal 
similarity transformations using Householder transformations as in 
Sec. 10.4. Thus we have that 


TP, A,P,T= H (10.5-40) 


where T is the product of Householder matrices and H is Hessenberg. Now 
the first row of TP, is the first row of P, {51}, which is itself the first row of 
Q?. Therefore, P,T is the Q of Theorem 10.17. Thus, by (10.5-25), H is the 
matrix A, obtained by the two steps of the QR algorithm (10-5.24). 

Now, while the double QR algorithm was originally derived to do two 
QR transformations with conjugate-complex shifts, it works just as well with 
two real shifts p, and p,, where we replace p + p by p, + p2and pp by p, p, 
in (10.5-37). We shall therefore refer to a pair of shifts (p,, p2), where p, and 
p> are either both real or complex conjugates. One recommended strategy 
for choosing them is as follows. At each stage pu, , v, are eigenvalues of T,, the 
lower 2 x 2 matrix of A,. Initially p{? = p§) =0. If |, — He-2| > 4] o%| 
and lv. —v%-2|>4|v%|, then p) = pk), p) = p¥-?. If 


[oe — Me-2| <2/]m| and |vy.— %-2| <3|%|, then PY = wy, PY =. 
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Otherwise, if |u,—p,-2]| <3l]u,.|, PY =pP=Reu,, while if 
|. —v,-2| <4|v,|, pi? = p}? = Re v,. The reasoning behind this strategy 
is as follows. If both eigenvalues of Tj, have settled down to the extent that 
the magnitude of the ratio of the difference between an eigenvalue of T, and 
the corresponding eigenvalue of TJ,_, to the eigenvalue of T, is less than 4, 
then we use them as shifts. If neither eigenvalue has settled down, we use the 
previous shifts. If only one of the eigenvalues has settled down, we assume 
that it indicates the presence of a real eigenvalue and use a real shift. 

To sum up, a complete step of the double QR algorithm is as follows: 


. Compute p“ + p®, pp based on the eigenvalues of T,. 

. Compute b,,, b2,, and b3, using (10.5-37) with p~, p¥ in place of p, p, 
determine the matrix P, using (10.5-38) and (10.5-39) and compute 
P, A,P. 

3. Reduce P, A, P, to Hessenberg form using Householder similarity trans- 

formations yielding A,,,. 


RQ pe 


As soon as a),_, or a”, ,-2 is 0 to within machine accuracy, we 
deflate to a matrix of order n — 1 or n — 2, respectively and accept a,,, as a 
single eigenvalue or the pair of eigenvalues of 7,, as a pair of eigenvalues of 
A. If, at any stage, one of the elements on the subdiagonal, say a‘),_,, 
becomes effectively 0, we decompose A, and continue working on the lower 
matrix of order n—j. One conservative criterion for accepting an off- 


diagonal element a‘’,_ ,, where j may also be n or n — 1, as 0 is that 


ja, .| <e min (Ja®|, Ja, ;-11) (10.5-41) 


where € is our relative accuracy tolerance. Alternatively, we may use the less 
stringent test 
Ja | <e(fa] + fa®, j-1]) (10.5-42) 


Jj-1 


Since the matrix B(p) is not a full matrix but is of a special form (in 
particular, its first column has only three nonzero elements), the matrix P, 
will also be of the following form 


xX x x 
p,-|* * * O (10.5-43) 
xX xX xX 
O I 


and P,A,P, will be nearly Hessenberg. Hence, the transformation of 
P, A, P, to Hessenberg form will be very efficient, and a double QR step will 
take only of the order of 5n? multiplications, which is a very satisfactory 
result. If we add to this the fact that we deflate the matrix each time we find a 
single root or a pair of roots, and that convergence is quite rapid so that an 
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average of fewer than two steps per eigenvalue is required, we can under- 
stand why the double QR algorithm has become the accepted method for 
finding the eigenvalues of Hessenberg matrices and, therefore, in view of 
Sec. 10.4, of general real nonsymmetric matrices. 

Finally, we mention one apparent problem with the QR algorithm. If 
there are different roots of the same magnitude, we need not converge to a 
matrix of the form (10.5-16). In general, this does not cause any difficulties 
since the introduction of shifts will eliminate this problem. However, it is 
possible to find matrices for which the shift strategy given above will not 
work. Thus, the matrix 


(10.5-44) 


eo oro 
oor OO 
or OOO 
- OO CO 
ooo o - 


which has the eigenvalues exp 4inr, r= 0, ..., 4, is left unchanged by the 
double QR algorithm as given above. Therefore, any program implementing 
the QR algorithm should introduce a random shift if there is no convergence 
after a fixed number of iterations. This will usually suffice to bring about 
convergence. 


Example 10.10 Apply the double QR algorithm to find all the eigenvalues of the following 
Hessenberg matrix A = A,. 


5.0 -—2.0 -—5.0 -—1.0 
1.0 0.0 —3.0 2.0 
2.0 2.0 —3.0 

10 —2.0 


A, = 


The computations below were carried out to about 14 significant figures using the 
listed values of p‘) and p%). We give the results rounded to 5 decimal places. As in Example 
10.9, we give the transformation from A, to A, in detail. Subsequently, we give only the 


eigenvalues p,, v, of 7,, the pairs (p, p*), and the matrices A,,,, k =3,..., 11. 
7 _ [20 -3.0 
' {10 —2.0 
so that yz, = 1.0, v, = —1.0. The initial values for p‘') and p$’ are p‘') = p! = 0. 


Using (10.5-37), we compute, for these values of p, and p,, 
b,, = az, + a,,a,, = 23.0 
by, = @3,(a,, + 422) = 5.0 
bs, = 3243, = 2.0 
From (10.5-39) we have 
u = [4662202 5.0 2.0 Oj" 
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and from (10.5-38) 


97368 —.21167 —.08467 0 

P, = | —.21167 97730 —.00908 0 

—.08467 -—.00908  .99637 0 

0 0 0 1.0 
4.11828 2.76448 5.72860 80433 
15331  .43031 —1.88167 —-2.19351 

so that P,A,P,= 

bee — 24348 2.18233 2.45142 —2.92260 

— 08467 —.00908 99637 —2.0 


In order to reduce P, A, P, to Hessenberg form, we generate Householder matrices 


as in Sec. 10.3-3. We first compute 


1.0 0 0 0 
P.= QO -—.51115 81181 .15165 
2" 10 81181 56389 —.15165 
0 28230 —.15165 94726 
4.11828 3.46451 5.35253 .67356 
— .29992 68720 3.69620 —3.89738 
and then = P,P, A, P,P, = 0 92577 1.05049 90434 
0 — 09024 —.02781 — 85597 
Next, we compute 
10 0 0 0 
p.-|9 10 0 0 
3 10 0 —.99528  —.09702 
0 0  —.09702 99528 
and finally 
4.11828 3.46451 —5.39263 1.51084 
—0.29992 .68721 —3.30064 —4.23760 
P,P,P,A,P,P,P3;3 = A;= 
SR ESS 93016 1.11718 = —.71200 
22015 —.92266 
Hs = 1.03720 v; = —.84268 pP=pn, pY=v, 
3.88102 5.50657 —2.83249 —.98166 
A .08301 2.29698 -—3.41894 3.17136 
5 165698 —.18116 2.41387 
01060 —.99684 
Hs = —.15091 vs=—1.02709 pi) = p?) = v, 
4.01900 .73616 —6.17208 85558 
A. = — .02762 —.19762 —3.26500 —3.83089 
1 1.68124 2.17861 1.23705 
—ée —1.00000 
py, = 2.17861 v>= —1.00000 = pp? = pi =v, 
4.01231 5.45684 —3.02039 —.83340 
A= — 00903 2.11330 —3.33301 3.23455 
9 1.57244 .12561 2.40441 
0 — 1.00000 
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B.99618 4.15400 4.64849: * 
99182 37161 1.17200: * 
Ay, = — 3.74868 1.63220: * 


4.00000 5.04835 — 3.65643 : 
O 1.87894 -359100: * 
Ais = 1.32902 12106: * 


A,, is in block diagonal form, and we compute from it the eigenvalues A, of A: 
4, = 40 A,=10+20i A,=10- 20i A, = —-—1.0 


Note that since we deflated A,, we chose p‘?) = p?) = 0. 


10.6 ERRORS IN COMPUTED EIGENVALUES AND 
EIGENVECTORS 


In most of the methods considered above, we do not compute the 
eigenvalues directly from the given matrix A but from a similar matrix B, to 
which we have transformed A. In the course of the computation of B from A, 
roundoff errors enter, so that B is not strictly similar to A. However, by 
backward error analysis, it can be shown that for all the methods discussed 
in this chapter B is similar to A + 6A, where the elements of 6A are small. 
Thus, in the reduction of A to Hessenberg or tridiagonal form by House- 
holder transformations using t-digit floating-point binary arithmetic, it can 
be shown that 


J5Alle < yn?2"Alle (10.6-1) 


where y is a constant of order 1. If inner products are accumulated to double 
precision, the bound (10.6-1) becomes 


|5Alle < yn2-"|Alle (10.6-2) 


The important question to consider is the effect of perturbations of the 
elements of A on its eigenvalues (cf. Sec. 8.13). 

Let us first consider the symmetric case, where both A and B, and hence 
also 6A, are symmetric. If the eigenvalues of these matrices are «;, B;, and 6;, 
respectively, arranged in nonincreasing order, it can be shown that 


a +06,<5 8, <a, + 9, (10.6-3) 
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If the elements of 6A are all less than « in magnitude, we have further that 
—né <0, <0, <ne (10.6-4) 
so that —né < B; — 4; < ne (10.6-5) 


These results hold even when there are multiple eigenvalues, and so it fol- 
lows that the eigenvalue problem for a symmetric matrix is always 
well-conditioned. 

In the nonsymmetric case, we may have ill conditioning. Let a; be a 
simple eigenvalue of A and x; and y; the corresponding right and left eigen- 
vectors normalized so that ||x; ||, = |ly; ||, = 1. Then as 6A tends to the null 
matrix, A + 0A has an eigenvalue a; + da; such that 


T 
5a, ~~ vn (10.6-6) 
Thus, for 6A small enough, 
| 5a,| < rea (10.6-7) 


where y;x; is the cosine of the angle 6, between y, and x;. Matrices exist with 
simple eigenvalues for which the cos 6; are arbitrarily small, and any such 
eigenvalue is very sensitive to perturbations in the elements. Still, if 
|S All. <«, we have that |da;| < €/|cos 6;| and the right-hand side is linear 
in €. If a; is not a simple root, the situation may be worse. For example, the 
matrix 


‘ | (10.6-8) 
has the double eigenvalue «, = a, = a, while the perturbed matrix 
" ‘ (10.6-9) 


has the eigenvalues a + €!/?. 


We see that for a particular matrix A, some eigenvalues may be sensitive 
while others are not. For simple eigenvalues, it depends on the angle be- 
tween the corresponding right and left eigenvectors. In case cos 6; is not 
small, «; is well-determined and a good algorithm should give accurate 
results. 

The sensitivity of the eigenvectors of a matrix with respect to perturba- 
tions in its elements is a more complicated problem. If A has distinct 
eigenvalues, the eigenvector x; of A + 6A is such that as ||A|| +0 


x; —x;~ Y (y75Ax;)x,/(%; — a) cos 6, (10.6-10) 
kFi 
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In the symmetric case, all the cos 6, = |, so that the sensitivity of the 
eigenvector x; is dependent only on the closeness of «; to other eigenvalues. 
In the nonsymmetric case, the factors cos 6, make things more complicated. 
However, if none of the cos 6; is small, the behavior is much the same as for 
symmetric matrices. 


BIBLIOGRAPHIC NOTES 


The literature on the calculation of eigenvalues and eigenvectors is almost as vast as that on 
simultaneous linear equations. The best general source for the material of this chapter is the 
book by Wilkinson (1965). Some good classical sources are Bodewig (1959), Faddeeva (1959), and 
Householder (1953). More recent sources are Faddeev and Faddeeva (1963), Fox (1965),Stewart 
(1973) and Gourlay and Watson (1973). A compendium of fully documented Algol programs 
which implement almost all the methods discussed in this chapter, in addition to some not 
treated here, is given in Wilkinson and Reinsch (1971). The corresponding Fortran programs 
appear in Smith et al. (1974). A good summary comparing various methods is given by Wilkin- 
son (1966). An extensive survey of the theoretical aspects of the calculation of eigenvalues and 
eigenvectors as well as an excellent bibliography is contained in Householder (1964). 


Section 10.1 The material in this section is all classical. Good sources are Bodewig (1959) 
and Householder (1953), both of which contain extensive bibliographies. 


Section 10.2 Our major source for the material of this section is Bodewig (1959). Much of 
the material can also be found in Faddeeva (1959). Wilkinson’s technique can be found in 
Wilkinson (1955). The Rayleigh quotient appears in a number of areas of numerical analysis; 
see, for example, Kopal (1961). The generalized Rayleigh iteration is studied in great detail ina 
series of papers by Ostrowski (1958-1959). Computational aspects of inverse iteration are 
discussed in Ortega (1967). 


Section 10.3 The various methods for computing eigensystems of symmetric matrices are 
discussed at length by Schwarz et al. (1973). There is a large literature on the Jacobi method. 
For a discussion emphasizing computational aspects, see Greenstadt (1960). The reformulation 
(10.3-19) of the Jacobi equations (10.3-11) to (10.3-13) is given in the contribution by Rutishauser 
in Wilkinson and Reinsch (1971). Pope and Tompkins (1957) discuss the threshold Jacobi method 
{24}. An error analysis is given by Goldstine, Murray, and Von Neumann (1959). Wilkinson 
(1962) analyzes the errors in the Jacobi method as well as those in other methods based on 
orthogonal transformations. For original papers on the other methods of this section, see 
Givens (1954) and Householder and Bauer (1959). The use of the method of bisection is 
considered by Wilkinson (1959a) and Ortega (1967). The analysis of the recurrence (10.3-40) is 
due to Kahan (1966). Householder’s method, especially in its computational aspects, is carefully 
explained by Wilkinson (1960). Wilkinson (1958b) illustrates the problems involved in comput- 
ing the eigenvectors of tridiagonal matrices. 


Section 10.4 The biorthogonalization technique is due to Lanczos (1950). Wilkinson 
(1958a) discusses some computational aspects of the method. The use of Gaussian elimination 
to reduce a matrix to Hessenberg form is due to Wilkinson (1959b). The technique to find the 
eigenvalues of such a matrix is due to Hyman (1957). Wilkinson (1959a) discusses some compu- 
tational aspects of Hyman’s and other methods. See Greenstadt (1955) for a discussion of the 
Jacobi method for nonsymmetric matrices. White (1958) considers all these methods and con- 
tains an extensive bibliography. Goldstine and Horwitz (1959) have generalized Jacobi’s 
method to normal matrices. The norm-reducing method appears in Eberlein (1962). Algorithms 
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implementing this method are given in the contributions by Eberlein and Boothroyd and by 
Eberlein in Wilkinson and Reinsch (1971). 


Section 10.5 The LR transformation method is due to Rutishauser (1955). The best paper 
on this method in English is by Rutishauser (1958). The QR transformation is due to Francis 
(1961). Parlett (1964a) gives an excellent exposition of both transformations. The computer 
implementation of the double QR algorithm is discussed in Parlett (1967), which is the source of 
Theorem 10.15 and Example 10.8. Stewart (1973) contains a good treatment of both the simple 
and double QR algorithms, including explicit algorithms for both methods. For the application 
of the QR algorithm to the computation of eigenvectors, see Stewart (1973) and the contribu- 
tion of Peters and Wilkinson in Wilkinson and Reinsch (1971). 


Section 10.6 Chapter 2 of Wilkinson (1965) discusses in detail the perturbation theory of 
eigenvalues and eigenvectors for both symmetric and nonsymmetric matrices. The bounds 
(10.6-1) and (10.6-2) are from Stewart (1973). For an extensive and excellent discussion of errors 
in eigenvalue computations see Wilkinson (1964). 
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PROBLEMS 


Section 10.1 


1 (a) Let x,and y; be, respectively, a right and left eigenvector of a matrix A. Prove that x 
and y, are orthogonal if they correspond to distinct eigenvalues. 

(b) Therefore, prove that the (right) eigenvectors of a symmetric matrix corresponding to 
distinct eigenvalues are orthogonal. 


2 (a) Prove Theorems 10.1 and 10.2. 
(b) Use Theorem 10.1 to prove Theorem 10.4 in the case where A is symmetric. 


(c) Prove Theorem 10.4 when A has a number of independent eigenvectors equal to its 
order. [Ref.: Birkhoff and MacLane (1953), pp. 306-307, and Bodewig (1959), pp. 59-60.] 


3 Derive an algorithm for computing the coefficients of a polynomial of degree n given the 
sums of the first n powers of its zeros. 


*4 Danilevsky’s method. Let A = [a,,;] be a matrix of order n. 

(a) Suppose a, ,_, #0. Let M,_, be the identity matrix of order n with its (n — 1)st row 
replaced by ~—@,1/Qy. 9-1, —Gn2/Qn.n—15 ---> 1Wyn— 1) —@an/n, ny. Show that M;', is the 
identity with its (n — 1)st row replaced by the nth row of A. 

(b) Show that M,',AM,-_, has zeros in the nth row except in the (n — 1)st column, 
where there is a 1. 

(c) Show that, by applying this technique n — 1 times, if the element to the left of the 
diagonal term is not 0 at any stage we can reduce A to a similar matrix B of the form 


a ree >, 
1 0: oe we ee 0 

B= 1 : 
O : 

1°-0 


(d) Show that the characteristic equation of B is 
P(A) = (= 17 (2" = py a8 = pad? == py) =0 


(e) Show that the number of multiplications and divisions required to calculate the 
characteristic equation is (n — 1)(n? + n). 

(f) Suppose the process has proceeded to the stage where the k + | to n rows have been 
reduced to the desired form. Suppose also that the element in the (k, k — 1) position is 0 but that 
for some j < k — 1 the term in the (k, j) position is not 0. How can the process be continued? 

(g) Suppose now that the elements in the (k, /) positions for all j < k are all 0. What can 
be done in this case? [Ref.: Fadeeva (1959), pp. 166-176.] 
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5 Let A be the matrix (cf. Example 10.7) 


2 —-2 3 
l 1] I 
1 3 —-1 


(a) Use Gerschgorin’s theorem to find a domain in which the eigenvalues must lie. 

(b) Repeat part (a) using A’ in place of A. 

(c) Let S be a diagonal matrix of order 3 with elements 1, 2, 2 on the diagonal. Calculate 
B = SAS~' and apply Gerschgorin’s theorem to BT’. 

(d) Generalize part (c) by indicating what happens to A when a diagonal matrix S with a 
in its last m positions and ones in the remaining positions is used to effect a similarity transforma- 
tion on A. 

(e) With m = 2, what value of a minimizes the total length of the domain on the real axis 
in which Gerschgorin’s theorem says the eigenvalues of A must lie? 

6 Prove that two similar matrices have the same trace. 

7 (a) Let A and B be two matrices such that AB is defined. Let the ranks of A and B be 
r(A) and r(B). Prove that r(AB) < min [r(A), r(B)]. 

(b) Thus deduce that if A is nonsingular, then r(AB) = r(B). 

(c) Finally deduce that two similar matrices have the same rank. 

*8 Generalized eigenvectors. 

(a) Show that the number of eigenvectors that a matrix A has corresponding to an 
eigenvalue A; is equal to the number of different elementary divisors (10.1-42) in which 4, 
appears. 

(b) Suppose A has an elementary divisor of order v corresponding to A,. Show that there 
exists a solution x, of the system 


(A —A,1)’x =0 
such that (A —A,1)"'x, #0 
(c) If we define 
(A—A,I~x,=x,-, j=l...v-1 


show that 
(A —A,1Px; =0 j=il,...,v-1 
(The vectors x;, j = 1,..., v, are called generalized eigenvectors of rank j corresponding to 4;. 
The vector x, is an eigenvector.) 
(d) Prove that the vectors x,, j = 1, ..., v, are linearly independent. 


Section 10.2 


9 (a) Where does the derivation of the power method fail if A has nonlinear elementary 
divisors ? 
(b) Show that the matrix 


6 2 2 
A=] -2 2 0 
0 0 2 


has nonlinear elementary divisors. 
(c) Do 10 iterations of the power method on this matrix using (1, 1, 1) as the starting 
vector. Does the method seem to be converging? 
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10 (a) Suppose the dominant eigenvalue A, of A is multiple and that the power method 
has been used to compute A, and a corresponding eigenvector x,. In order to compute another 
eigenvector corresponding to A,, is it sufficient to choose an initial vector v, which is orthogonal 
to x, ? Why? Is it necessary that v,) be orthogonal to x,? Why? 

(b) From this deduce a technique to calculate all the eigenvectors corresponding to a 
given eigenvalue. 

(c) If it is not known a priori what the multiplicity of the eigenvalue is, what will happen 
when all the eigenvectors for a given eigenvalue have been found? 

(a) Apply this technique to find all the eigenvectors corresponding to the dominant 
eigenvalue of 

7 23 —13 
A= 3 53 13 
1b Skah 

11 The matrix B in part (c) of Prob. 4 is called the companion matrix of the polynomial 
P(x) in part (d) of that problem. 

(a) Show why the power method applied to the companion matrix is precisely equivalent 
to Bernoulli’s method applied to P(x). 

(b) What operation applied to the companion matrix corresponds to Graeffe’s method? 


12 (a) Let f(x) be a function of x with a Maclaurin expansion F(x). If A is a square 
matrix, define f(A) = F(A). What are the eigenvalues of f(A)? (Cf. Theorem 10.1.) 
(b) Consider the iteration 


Vne1 =S(AW,, m=1,2,... 


where v, is arbitrary. Generalize Theorem 10.12 for this iteration. 

(c) If y is a good approximation to an eigenvalue A, of A, why should f(x) = 1/(x — y) be 
a good function to use in part (b)? But what happens if this function is actually used? 

(d) But deduce from part (c) that for computational purposes using 


f(A)=A¥ + yA hte ty tatty k>I1 


should produce more rapid convergence than using A if A, is the dominant eigenvalue. [Ref.: 
Bodewig (1959), pp. 322-323.] 


13 (a) Use the technique of the previous problem with k =2 and y =2 to find the 
dominant eigenvalue of the matrix of Example 10.1. 

(b) Show the connection between the technique of the previous problem and Wilkinson’s 
method. 

14 (a) What does the iteration of part (b) of Prob. 12 converge to if f(x) = e*?7 If f(x) = e~*? 

(b) Show how f(x) = e~* might be used to determine if a matrix is positive definite. 

15 (a) Derive (10.2-13) from (10.2-12). 

(b) Apply the 5? process to the calculation of Example 10.2 with m = 5. Using the results 
of Example 10.4, explain whether or not this result is just fortuitous. 

16 (a) If the eigenvalues of A are all positive, show that the optimal value of p in Wilkin- 
son’s method is (A, + A,)/2. 

(b) What is the optimal choice for p in Example 10.1 given the results of Example 10.4? 

17 Explain why you would expect the Rayleigh quotient to generally give better results 
than the 5? process. 


18 Use the power method to calculate the dominant eigenvalue and corresponding eigen- 
vector of 


7 3 -2 3 -4 3 
(a2) A=] 3 4 -1] (b) 4=]-4 6. 3 
-2 -1 3 3 3 1 


Stop each iteration when three decimal places of the eigenvalue have stabilized. 
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19 (a) Use the 6? process, where applicable, to get an improved value of the eigenvalue 
for each part of the previous problem. 

(b) Use Wilkinson’s method with p = 2 to speed up the convergence of the power method 
for the matrix A, of the previous problem. 

(c) Use the Rayleigh quotient to get an improved value of the eigenvalue for A, and A, in 
the previous problem. 


20 Show that (10.2-19) holds for the inverse power method given the assumptions 
preceding this equation. 


21 (a) Letv,=x,;+ )’ €4,x,, where the eigenvectors of the symmetric matrix A are x,, 


k=1 
..., X, normalized so that ||x, ||, = 1, k = 1,..., n, and where >” denotes that the index k = j is 
omitted in the summation. Show that v,, , given by (10.2-22) and (10.2-23) satisfies, apart from a 
normalizing factor, the relation 


1+ ee An Mist 
so that all coefficients other than that of x, are cubic in the ¢,,. 
(b) Let A be a general matrix with n distinct eigenvalues 4,, ..., A, and corresponding 


right and left eigenvectors x,,...,x, and y;,..., y,, respectively. Consider the following iteration 
starting with v, and w,: 


Vier =X; + 


w) Av; - ~ r 4 
Bini View = Kin (A — wisi!) Vi Wier = Kis (Ae — win DOW, 
iti 
where k;, , and k;, , are normalizing factors chosen so that |lv;, , || = {|w;,, || = 1. Show that if 


Vv, =X; + )’e€,,X, and w; = y,; + >’, y,, then 
Mey = ATX; + VAL nM Vac Xe 
ee i a ee 
yjX; + yee M ik Ye X, 
(c) Show that, apart from normalizing factors 
Vier = Xp + (A, — Bi+ 1)>."€ jn Xi /(Ay — Hi+1) Wi41 = Yj t (A; — Hi+ 1) Nik Villa, — Hi+1) 


so that all coefficients other than those of x; and y, are cubic in the €,, and 7;,. 


Section 10.3 


22 Show that (10.3-19) is equivalent to (10.3-11) through (10.3-13). 


23 Give a formal proof of the convergence of the Jacobi method to a diagonal matrix 
similar to A in the case where the off-diagonal element of greatest magnitude is annihilated at 
each stage. Why isn’t this proof sufficient to show that the method converges for any choice of a 
nonzero off-diagonal element at each stage? 


*24 The threshold Jacobi method. 

(a) With vy given by (10.3-4) define y, = ./v9/o, where a is a positive number called the 
threshold constant. If o > n, show that there is at least one off-diagonal element in the original 
matrix A greater than or equal to y, in magnitude. 

(b) If all off-diagonal elements of magnitude greater than or equal to y, are annihilated, 
show that the remaining sum of squares of the off-diagonal elements is no greater than 
(1 — 2/07) . 

(c) Define y,,, =y,/o, i= 1, 2, .... Then at the ith stage of the method, all elements of 
magnitude greater than or equal to y, will be annihilated. Let {y, 3 be the subsequence of {y;} 
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such that at the i; stage at least one element is annihilated. Deduce that, after i, stages, the sum 
of squares of the off-diagonal elements is no greater than (1 — 2/c7)"v,. 

(d) Suppose we set an accuracy requirement that the final sum of squares of the off- 
diagonal elements should be less than p*v, where p is some constant. Show that this require- 
ment will be satisfied if the final threshold y, is such that yp < (p/n)./vo - [Ref.: Greenstadt 
(1960).] 

25 Do three stages of the threshold Jacobi method, in which at least one element is 
annihilated, for the matrix of Example 10.1 using o = 3. 


26 (a) Show that the sequence of orthogonal transformations defined by (10.3-32) annihi- 
lates the elements a,,, 4 5,.--, @>, and does not affect the zeros in the first row obtained using 
(10.3-30). 

(b) Thus deduce that the sequence of transformations (10.3-33) does indeed reduce A to 
the form B in (10.3-34). 

27 (a) Derive (10.3-36). 

(b) Show that if any c; in (10.3-34) is 0, then the determinant of B can be written as the 
product of the determinants of two smaller tridiagonal matrices. 

(c) Prove that the sequence defined by (10.3-36) is a Sturm sequence if no c, = 0. 

(d) Verify (10.3-38) when i = 2. 

28 Derive the relationship between the eigenvectors of B in (10.3-34) and those of A. 


*29 (a) Show that the matrix P defined by (10.3-42) is symmetric and orthogonal. 
(b) Use (10.3-45) to (10.3-47) to show that A, has zeros in the same positions in its first 
k — 2 rows and columns as A, _ ,. 
(c) Show that with the elements of v chosen as in (10.3-49) and (10.3-50), A, has zeros in the 
desired positions in the (k — 1)st row and column and that (10.3-41) is satisfied. 


30 Let y be an eigenvector of A,_, given by (10.3-45). 
(a) Show that the corresponding eigenvector x of A is given by 


X= P,P,°°° P,-.y 


How would you calculate x given y? 

(b) Compare this result with that of Prob. 28 to show that about half as much computa- 
tion is required to compute the eigenvectors of A in Householder’s method as in Givens’ 
method. Assume that both v, and 2v, are available from the computation of the eigenvalues (cf. 
Example 10.6). 

(c) Derive the equation for A, given in Example 10.6. 


*31 (a) Verify that the total number of operations required in Givens’ method to reduce 
the matrix to tridiagonal form is of the order of $n° and that (n — 2)(n — 1)/2 square roots must 
be calculated. 

(b) Verify that the corresponding figures for Householder’s method are 3n° operations 
and 2n — 4 square roots if the scheme of Example 10.6 is used. 

(c) Show, however, that the v*) need not be calculated explicitly and that therefore only 
n — 2 square roots are required. [Ref.: Wilkinson (1960).] 


32 Carry through the computation of Example 10.6 using the minus instead of the plus 
sign in (10.3-49). Keep five decimals in all computations. Compare the results with those of 
Example 10.6 and explain the differences. 


33 (a) Use the Jacobi method as described in Sec. 10.3-1 to find the eigenvalues of A, and 
A, of Prob. 18. Carry through five rotations. Why are the results for A, better than those for 
A,? 

(b) Use Givens’ method to tridiagonalize A, and A, and then calculate the characteristic 
equation. 

(c) Use Householder’s method to tridiagonalize A, and A, and then calculate the charac- 
teristic equation. 
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34 (a) Show that if P, and A, are defined by (10.3-44) and (10.3-45), respectively, where 
now A,_, is not necessarily symmetric, A, can be computed by the following series of 
operations: 


Pr = VAg- B, = Ay_ 1 — 2V, Py 
qi. = BY, A, = B,— 24,%; 
(b) Determine the number of operations needed to reduce a general matrix A to upper 


Hessenberg form using this formulation with the improvement given in Prob. 31c. [Ref.: Wil- 
kinson (1966), p. 42.] 


Section 10.4 


35 (a) Use Lanczos’ method to tridiagonalize the matrix of Example 10.1. Then calculate 
the characteristic equation and find its roots. 

(b) How many operations are required in the application of Lanczos’ method to a symmet- 
ric matrix of order n to reduce the matrix to tridiagonal form? 

(c) Apply Lanczos’ method to the matrix 


5 1 —-1 
A= ]-—-5 0 ] 
l 0 ] 


using x1 = (.6, — 1.4, .3) and yf = (.6, .3, —.1). 

(d) Repeat the calculations of part (c) using y{ = xf = (.6, — 1.4, .3). Then calculate the 
characteristic equation. [Ref.: Wilkinson (1958a).] 

36 (a) Calculate the eigenvectors of the matrix T in Example 10.7. 

(b) Use the results of part (a) to calculate the eigenvectors of A. 


*37 Let A be an arbitrary matrix and let x, and y, be arbitrary vectors. Define two 
sequences of vectors 


Xj41 = AX; — » ojX; Yin = AY; — yoy; i=1,2,... 
jel j=l 


(a) Show that 
Xi41 = P(A)x, and Vier = P(A’ )y, 


where P,(A) is a polynomial of degree i in A. 

(b) Thus deduce that y/, 1x; = yjX;41. 

(c) Let the c,, be chosen so that y/, ,x;,, is a minimum. Show that this requires that 
O= -YfXi41 —YieiXpJ=hL.. ik. 

(d) From parts (b) and (c) deduce that this minimum requirement means that the vectors 
x; and y; must form a biorthogonal sequence. 

(e) Deduce then that those c,;; which give the minimum are given by c,; = yj Ax,/y;X;- 

(f) Then deduce that c,,=0,j<i- 1. 

(g) Finally deduce that these sequences of biorthogonal vectors are precisely those of 
Lanczos’ method. Because of the requirement of part (c), Lanczos’ method is therefore often 
called the method of minimized iterations. [Ref.: Lanczos (1950).] 


38 (a) Display the elementary column matrices used in the reduction of a matrix A to 
Hessenberg form by Gaussian elimination with partial pivoting. 

(b) Display the inverses of the matrices of part (a). 

(c) Verify that the number of multiplications and divisions required for this method is 
gn? + O(n’) as opposed to $n? + O(n?) for Householder’s method. 
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(d) Use this Gaussian elimination technique to derive the matrix B in Example 10.8. [Ref.: 
Wilkinson (1959b).] 

39 Suppose an element b; ;,, above the principal diagonal in Fig. 10.2 is zero. 

(a) Show that B can be written 


Be 5 0 [trons 


B, B, 


where B, and B, are both in Hessenberg form. 

(b) Show how the eigenvectors of B can be found from those of B, and B,. 

40 By taking the logarithmic derivation of F ,(A), that is, computing d[log F ,(A)]/d4, show 
that (10.4-16), and consequently (10.4-17), hold. 

41 (a) If all the eigenvalues of a matrix are real, does it follow from Corollary 10.2 of 
Sec. 10.1 that the matrix can be triangularized using a sequence of plane rotation matrices of the 
type (10.3-9)? Why? If the eigenvalues are real, is it always possible to annihilate any given 
off-diagonal element by an orthogonal transformation using plane rotation matrices? 

(b) What modifications must be made in (10.3-11) to (10.3-15) for nonsymmetric matrices? 

(c) Use plane rotation matrices to annihilate successively the largest element in the upper 
triangle of the matrix of Example 10.7. Do six iterations. [Ref.: Greenstadt (1955).] 


42 (a) Show that if A is normal, then AA* = A*A. 
(b) Show that if AA* = A*A, then the triangular matrix T in Theorem 10.10 is diagonal, 
so that A is normal. 


Section 10.5 


43 (a) Let A = [a,,] be a band matrix; that is, a;; = O for |i — j| > mfor some m<n— 1. 
Prove that the LR transformation applied to A results in a sequence of matrices A, all of which 
are band matrices with the same value of m. 

(b) Prove that the QR transformation applied to a symmetric band matrix results in a 
sequence of symmetric band matrices with the same value of m. 

(c) Show that if A is in upper Hessenberg form, the QR transformation results in a 
sequence of matrices A, each of which is in upper Hessenberg form. [Ref.: Rutishauser (1958), 
p. 71, and Francis (1961).] 

44 Show that if A = diag (A,, ..., 4,) and Lis a unit lower triangular matrix with elements 
i> j, then AXLA~* = I + B,, where B, is defined by (10.5-11). 
45 Verify (10.5-20) and (10.5-21). 


46 (a) Show that one cycle of the simple QR algorithm applied to a Hessenberg matrix 
results in a Hessenberg matrix and takes about 4n? multiplications and n — 1 square roots. 

(b) Show that in either the Hessenberg or tridiagonal cases a zero subdiagonal element 
aliows decomposition of the problem into smaller ones. 

47 Let A, be a symmetric tridiagonal matrix. Show that R, given by (10.5-22) has nonzero 
elements only on the main diagonal and on the two diagonals above the main diagonal. Derive 
algorithms for the computation of the elements of R, and the elements of A,,,. (These algorithms 
can be combined to form one compact algorithm for the QR transformation for symmetric 
tridiagonal matrices.) [Ref.: Ortega and Kaiser (1963).] 


48 Apply the QR transformation to the two tridiagonal matrices obtained in Prob. 33b. 

49 Apply the LR transformation directly to the matrices of Prob. 18, and compare these 
results with those of the previous problem. 

50 Verify that (10.5-25) and (10.5-26) hold. 


51 (a) What is the structure of the Householder matrix P, which causes the subdiagonal 
elements of column j of a matrix to vanish? 


l 


ij? 
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(b) Show that pre- or postmultiplication of any matrix A by P,, j > 1, leaves the first row 
and column of A unchanged. 

(c) Thus show that the matrix T such that TBT, for any matrix B, is in Hessenberg form 
has the structure of P,. 


Section 10.6 


52 (a) Assume that A has n distinct eigenvalues «, and corresponding right and left 
eigenvectors x , and y ,, respectively, j = 1,...,n. Show how(10.6-6) follows from the fact that the 
right and left eigenvectors of A + 6A corresponding to a; + da, can be written as 


x+ >} €;;X; and yit Domi; 
j=l j=l 
jFi j¥i 


respectively, where the ¢€,, and y,,; > 0 as 6A - 0. 
(b) Prove that for any vectors x, y and any matrix A 


|y"Ax| < lll} |All. Ilxil. 


53 (a) Let 4 and x be approximations to an eigenvalue and eigenvector of a symmetric 
matrix A. If r = Ax — Ax, show that there exists an eigenvalue 4, of A such that 


[A,; -A[P <r7r =e? 


(b) By considering 


a=|? 4 A=a andx’?=[1 0] 
é a 


show that no useful bound on the error in the eigenvector can be obtained. [Ref.: Wilkinson 
(1959c, 1964).] 
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Computational efficiency, 336 
Condition: 
of a matrix, 431 
of a problem, 22, 23 
Consistency condition, 442 
Consistent method, 169, 177 
Continued fraction, (Prob. 5) 47, 292 
Convergence: 
essential, 523 
of numerical integration methods, 176, 181 
rate of, 333 
Convergence factor, 199 
Correctors, 186-189 
Adams-Moulton, 188 
Cost of computation, 337 
Cost coefficient, reduced, 459 
Cotes numbers, 119 
Cramer’s rule, 395 
Crout algorithm, 422 
Cubic spline, natural, 74, 77 


Danilevsky’s method, (Prob. 4) 541 
Davidenko’s method 363 
Deflated polynomial, 372 
Deflation, implicit, 334 
Degeneracy: 
in rational approximations, 315 
in simplex method, 463 
62-process, 358-359, 446, 495 
Derivative-estimated iteration formulas, 
350-352 
Determinants, 465 
Diagonally dominant matrix, 432 
Difference equation, 174 
Difference operators, (Prob. 9) 82 
Differences, 57-61 
backward, 58 
central, 58 
divided, (Prob. 22) 85 
forward, 58 
table of, 59 
Differential correction algorithm, 318-320 
Differential equations: 
numerical solution of, 164-232 
(See also Numerical integration methods; 
Runge-Kutta methods ) 
Digit reversal, 266 
Discrete Fourier transform, 263 
Divided differences, (Prob. 22) 85 
Doolittle algorithm, 421 
Double precision arithmetic, 19 


Eberlein’s method, 520 
Economization: 
of power series, 308-309 


Economization: 
of rational functions, 309-31 | 
Efficiency index, 337 
Eigenvalues: 
of matrices: basic theorems on, 484 
bounds on, 486 
calculation of, 483-538 
errors in, 536-538 
power method for, 492-501 
location of, 486 
of nonsymmetric matrices: calculation of, 
513-521 
Jacobi-type methods, 520-521 
Lanczos’ method, 514-516 
supertriangularization, 516-519 
of symmetric matrices: calculation of, 
501-513 
Givens’ method, 506-511 
Householder’s method, 511-513 
Jacobi method, 501-506 
Eigenvectors, 483 
calculation of, 483-538 
errors in, 536-538 
generalized, (Prob. 8) 542 
of symmetric matrices, 501-513 
Elementary divisor, 473 
Elementary reflector, 452 
Equal ripple polynomial, 301 
Equilibration, 425-430 
for Crout algorithm, 427 
for Doolittle algorithm, 427 
Error: 
absolute, 4 
accumulated, 178, 179 
definitions of, 4 
in eigenvalues, 536-538 
in functional evaluation, 5 
probable, 11 
propagated, bounds and estimates for, 
182-183, 215 
relative, 4, 438 
roundoff, 4, 9-12 
( See also Roundoff error) 
sources of, 2 
truncation, 3 
( See also Truncation error) 
types of, 3 
Error analysis, 17, 20-22, 43-44 
backward, 21 
forward, 20 
roundoff, 9-12, 216, 434 
of solution of linear systems, 430-437 
Error bound, propagated, 182 
Error propagation, 166 
Euclidean matrix norm, 8 
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Euclidean norm, 8 Gauss-Jordan reduction, 416 
Euler-Maclaurin sum formula, 136 Gauss-Laguerre quadrature, 106 
second, (Prob. 62) 161 Gauss-Legendre quadrature, 101, 115 
Euler transformation, 143-145 Gauss-Seidel method, 444-447 
Euler’s constant, (Prob. 64) 161 convergence of, 446 
Euler’s method, 178, 188 Gaussian elimination, 415-418, 517 
Everett’s interpolation formula, (Prob. 14) compact forms for, 418-421 
83 error analysis for, 430-437 
Explicit integration formula, 167 matrix formulation, 418-421 
Exponential approximation, (Prob. 34) 87 Gaussian quadrature, 98-101 
Extrapolation, 78 applied to singular integrals, 111-113 
active, 227 in composite formulas, 113-117 
passive, 227 over infinite intervals, 105-108 
Extrapolation method, 226 and orthogonal polynomials, 104-105 
of numerical integration, 165 weight functions in, 102-103 


Gaussian quadrature formulas, 101, 133, 134 
summary of, 114 


Factorial functions, 141 Gear’s method, 230-231 
False position, method of, 338 Generalized eigenvector, (Prob. 8) 542 
Fast Fourier transform, 263-270 Generating function, (Prob. 16) 49 
algorithm for, 265 Gerschgorin’s theorem, 486 
digit reversal, 266 Givens’ method, 506-51 1 
Sande-Tukey form, (Prob. 27) 281 Graeffe’s method, 376-378 
Fibonacci sequence, 341, (Prob. 3) 472 Gram polynomials, 259 
Finite difference interpolation formulas, Gram-Schmidt process, 256 
61-63 Gregory’s formula, 140 


Finite differences, 57-63 
( See also Differences ) 


Fixed-point arithmetic, 12 Halley’s method, (Prob. 9) 401 
Floating-point arithmetic, 15 Hamming’s method, 194, 202-208 
Floating-point numbers, 13 Hardy’s rule, (Prob. 44) 158 
Forward difference, 58 Harmonic number, (Prob. 16) 50 
Forward integration formula, 167 Hermite interpolation, 70-73 
Fourier functions, 33 Hermite interpolation formula, 72, 98, 188, 
Fourier transform, discrete, 263 224 
Fourier’s conditions, (Prob. 5) 401 modified, 72 
Fraser diagram, 59 Hermite polynomials, 107 
Frobenius norm, 8 Hermitian matrix, 452 
Functional iteration, 334-337 Hessenberg matrix, 466, 513 
computational aspects, 336-337 lower, 466 
at a multiple root, 353-356 upper, 466 
Functional iteration methods, 335 Hilbert matrix, 252 
asymptotic error constant, 336 Horner’s rule, 287, 291 
computational efficiency, 336 Householder transformation, 452, 453, 527, 
order of, 335 533 
stationary, 335 Householder’s method, 511-513 
Functionals, 42-46 Hurwitz-Routh criterion, (Prob. 16) 239 
Gauss’ backward formula, 62 I!l-condition, 412-413 
Gauss-Chebyshev quadrature, 109 I!l-conditioned matrix, 252 
Gauss’ formula, ( Prob. 63) 161 Ill-conditioned polynomials, 395 
Gauss’ forward formula, 62 Ill-conditioned problem, 22 
Gauss-Hermite quadrature, 107 Illinois method, 341 


Gauss-Jacobi quadrature, 108 Implicit deflation, 334 
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Implicit integration formula, 168 
Index of a rational function, 287 
Induced norm, 7 
Influence function, 43, 172 
Inherent error, 21 
Inner product, 18 
Instability, 22 
Interpolant, 66 
Interpolation, 52-79 
at equal intervals, 56-63 
finite difference, 61-63 
Hermite, 70-73 
inverse, 68, 334, 348-350 
near a Singularity, (Prob. 27) 86 
iterated, 66-68 
Lagrangian (see Lagrangian 
interpolation ) 
near a singularity, (Prob. 25) 86 
polynomial, 52-79 
rational, (Prob. 37) 88 
spline, 73-78 
trigonometric, 271-273 


using reciprocal differences, (Prob. 16) 


325 
Interpolation formulas: 
finite difference, 61-63 
use of, 63-66 
Interpolation series, (Prob. 19) 84 
Interpolatory approximation, 34 
Interval, changing the, 197 
Inverse interpolation, 68, 334, 348-350 
near a Singularity, (Prob. 27) 86 
Inverse iteration, 499 
Inverse power method, 498-501 
Iterated interpolation, 66-68 
Iterated vector, 493 
Iteration: 
convergence of, 184 
functional (see Functional iteration ) 
inverse, 499 
Iteration formulas: 
computational efficiency of, 346 
derivative-estimated, 350-352 
by direct interpolation, (Prob. 19) 403 
multipoint, 347-353 
one-point, 344-347 
order of, 344 
using general inverse interpolation, 
348-350 
Iteration function: 
n-point, 335 
( See also Functional iteration ) 
Iterative formula, 168 
Iterative process: 
acceleration of, 449-450 


Iterative process: 
roundoff error in, 447-449 
stationary, 443-450 
Iterative refinement, 437-440 


Jacobi iteration, 443-444 
Jacobi method, 501-506 
for nonsymmetric matrices, 520-521 
serial, 504 
threshold, 504 
Jacobi polynomials, 108 
Jacobian, 231-232 
Jenkins-Traub method, 383-391, 499 
Jordan canonical form, 492 


Knot, spline, 73 
Kronecker delta, 54 
Krylov’s method, 485 
Kummer’s method, 142 


Lagrangian interpolation, 53-57 
at equal intervals, 56-57 
error in, 55 
formula for, 55, 118, 197 
used to generate iteration formulas, 334 
Lagrangian interpolation polynomial, 55, 
249, 274 
Laguerre polynomials, 106 
Laguerre’s method, 380-383 
Lanczos’ method, 514-516 
Least-squares, principle of, 248 
Least squares approximation, 34, 247-274 
orthogonal polynomial, 254-260 
polynomial, 251-254 
using Fourier functions, 271-274 
Least squares problem, 454 
Left eigenvector, 483 
Legendre polynomial, 99 
Levenberg-Marquadt algorithm, 363 
LeVerrier’s method, 485 
Linear approximation, 33 
Linear equations: 
overdetermined systems of, 451-457 
solution of simultaneous, 410-468 
basic theorem on, 411 
with complex coefficients, 465 
direct methods for, 414-430 
( See also Gaussian elimination ) 
error analysis of, 430-437 
iterative refinement, 437-440 


Linear equations: 
solution of simultaneous: matrix iterative 
methods for, 440-442 
roundoff error analysis, 434 
sources of error in, 414 
underdetermined systems of, 458 
Linear one-point matrix iteration, 44] 
Linear programming, 319 
simplex method for, 457-464 
basic solution for, 458 
degeneracy in, 463 
feasible solution for, 458 
standard form, 457 
tableau, 461 
unbounded solution of, 460 
Lozenge diagram, 59 
Lp norm, 7 
LR algorithm, 521-522 


Matrix: 
band, 466 
companion, (Prob. 11) 543 
complex, 465 
diagonally dominant, 432 
Hermitian, 452 
Hessenberg, 466, 513 
( See also Hessenberg matrix ) 
normal, 520 
orthogonal, 452 
plane rotation, 502 
triangular, 419 
tridiagonal, 466, 506 
Matrix inversion, 450-451 
Matrix iterative methods for solution of 
linear systems, 440-442 
linear one-point, 44] 
Matrix norm, 8 
Matrix powers, 498 
Maximally stable formula, (Prob. 28) 241 
Mehler quadrature, 108 
Midpoint method, 188 
modified, 226 
Midpoint rule, 120 
Milne-type estimate, 189 
Milne’s method, 186 
Minimax approximation, 34, 250, 286 
constructing, 315-320 
relative, 317 
Minimum maximum error techniques, 
285-320 
( See also Minimax approximation ) 
Muller’s method, 373, (Prob. 21) 403, 518 
Multipoint iteration formulas, 347-353 
Multistep method, 165 
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Newton-based method for polynomials, 
392-395 
Newton-Cotes formulas, 118-126, 131, 132, 
134, 170, 184, 187 
closed, 118 
composite, 121 
open, 120 
tables of, 120-121 
Newton interpolation formula, 56 
Newton-Raphson method, 230, 347, 355, 
360, (Prob. 29) 477 
for multiple roots, 354 
Newton’s backward formula, 62 
Newton’s divided-difference formula, (Prob. 
22) 85 
Newton’s forward formula, 61 
Newton’s identities, (Prob. 56) 408 
Newton’s method: 
damped, 361 
modified, 361 
quasi, 366 
( See also Newton-Raphson method ) 
Node, spline, 73 
Nonlinear equations: 
solution of, 333-397 
solution of systems of, 359-367 
Norm, 6-9, 431 
Euclidean, 8 
induced, 7 
matrix, 8 
spectral, 9 
subordinate, 7 
uniform, 7 
vector, 7 
weighted, 7 
Normal equations, 250 
solution of, 251-253 
Normal matrix, 520 
Normalized numbers, 14 
Null hypothesis, 254 
Numerical differentiation, 89-95 
of data, 89-92 
of functions, 93-95 
Numerical integration of data, 97-98 
Numerical integration methods, 
166-183 
based on higher derivatives, 224-226 
consistency of, 169 
convergence of, 176-178 
error estimation for, 189-191 
extrapolation-based, 226-228 
order of, 168 
self-starting, 195 
stability of, 173-182, 191-195 
truncation error in, 171-173 
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Numerical quadrature, 96-145 
(See also Gaussian quadrature; Newton- 
Cotes formulas; Quadrature formulas ) 
Nystrom’s method, 188 
Nystrom’s predictor, (Prob. 23) 240 


Obrechkoff’s method, (Prob. 3) 242 
One-point iteration formulas, 344-347 
Operator: 
difference, (Prob. 9) 82 
differentiation, (Prob. 3) 148 
shifting, (Prob. 9) 82 
Order: 
of accuracy, 55, 96 
of iteration, 338 
Ordinary differential equations, 164-232 
( See also Differential equations, numerical 
solution of) 
Orthogonal matrix, 452 
Orthogonal polynomials, 99, 104-105, 
254-260 
approximation with, 254-260 
and Gaussian quadrature, 104-105 
recurrence relations for, (Prob. 16) 151, 
256-260 
(See also Chebyshev polynomials; Gram 
polynomials; Hermite polynomials; Ja- 
cobi polynomials; Laguerre polynomials; 
Legendre polynomial ) 
Overdetermined systems of linear equations, 
451-457 
Overflow, 13, 19 


Padé approximations, 293-299 
error in, 295 
use in economization of rational functions, 
309-311 
Parabolic rule, 1, 122, 127 
Parasitic solution, 175 
Parseval’s theorem, (Prob. 24) 281 
Peano kernel, 43, 172 
Peano’s theorem, 43 
Pegasus method, 341 
Perron-Frobenius theory, 446 
Picard’s method, 196 
Pivot element, 425 
Pivoting, 425-430 
complete, 425 
partial, 425 
Plane rotation, 527 
Plane rotation matrix, 502 
Point-slope method, 188 
Polynomial approximation, 34, 251 


Polynomial equations (see Zeros of polyno- 
mials ) 
Power method, 492-501 
acceleration of convergence of, 495-498 
inverse, 498-50 1 
Predictor-corrector methods, 183-195 
compared with Runge-Kutta methods, 
223-224 
convergence of iterations in, 184-185 
error estimation, 189-19] 
stability of, 191-195 
use of, 198-200, 202-208 
Predictors, 185-187, 224 
Adams-Bashforth, 187 
Hermite, 188 
Probability density, 11, (Prob. 10) 26 
Probability distribution, (Prob. 10) 26 
Probable error, 11 
Prony’s method, (Prob. 34) 87, (Prob. 35) 
88 
Propagation of error, 166 
in Runge-Kutta methods, 215 
Purifying zeros, 372 


QR algorithm, 521-536 
double, 529-536 
simple, 526-529 
Quadratic factor algorithm, 287-291 
Quadrature: 
Gaussian (see Gaussian quadrature ) 
numerical, 96-145 
( See also Gaussian quadrature; Newton- 
Cotes formulas; Quadrature formulas ) 
Quadrature formulas: 
choosing, 130-135 
composite, 113-117 
Gaussian, 98-117 
Newton-Cotes, 118-126 
Quasi-Newton method, 366 


Rational approximations, 34, 286-320 
(See also Minimax approximation; Padé 
approximations ) 
Rational functions: 
index of, 287 
summation of, 141-143 
Rational interpolation, (Prob. 37) 88 
Rayleigh quotient, 497 
Reciprocal derivative, (Prob. 17) 325 
Reciprocal difference, (Prob. 14) 325 
Recurrence relation, (Prob. 16) 151, 
256-260 
Regula falsi, 338 


Relative error, 4, 179, 438 
Relative stability, 180 
Remes’ second algorithm, 315 
Residual, 248, 413 
Richardson extrapolation, 94, 123, 227 
Riemann sum, 117, 252 
Right eigenvector, 483 
Robustness, 130 
Rodrigues formula, (Prob. 13) 151 
Romberg integration, 123-126, 132, 133 
Root-squaring, 376-378 
Roundoff error, 4, 5, 9-12 
in iterative methods, 447-449 
probabilistic approach, 9-12 
in Runge-Kutta methods, 216 
in solution of linear systems, 433-437 
in solution of nonlinear equations, 334 
Runge effect, 66 
Runge-Kutta methods, 209-224 
compared to predictor-corrector methods, 
223 
error estimation, 219 
fourth-order, 217 
higher-order, 218 
propagated errors in, 215 
roundoff error in, 216 
second-order, 216 
stability, 22] 
third-order, 216 
truncation error in, 213 


Sampling theorem, (Prob. 38) 284 

Sande-Tukey algorithm, (Prob. 27) 281 

Scaling, 13 

Schur’s theorem, (Prob. 17) 239 

Schwarz constant, 493 

Schwarz inequality, 25 

Secant method, 338-344 
efficiency index for, 342 

Second Euler-Maclaurin sum formula, (Prob. 
62) 161 

Self-starting method, 195 

Serial Jacobi method, 504 

Shifting operator, (Prob. 9) 82 

Significant digit, 5 

Simplex method, 457-464 
(See also Linear programming, simplex 

method for) 

Simpson’s rule, 120, 127, 186 

Simultaneous linear equations (see Linear 
equations, solution of simultaneous ) 

Single step method, 165 

Singular integrals, 111-113 

Slack variable, 458 
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Spectral condition number, 431 
Spectral norm, 9 
Spectral radius, 8, 487 
Spline, 73 
knot of, 73 
node of, 73 
Spline interpolation, 73—78 
Square root calculation, 31, (Prob. 10) 48, 
(Prob. 33) 329 
Stability, 22 
of numerical integration methods, 173, 176 
of predictor-corrector methods, 191 
relative, 180 
of Runge-Kutta methods, 221 
Stationary iteration, 335, 440 
Stationary iterative process, 443-450 
Steady-state solution, 229 
Steepest descent, method of, 362 
Steffensen’s interpolation formula, 
14) 83 
Stein-Rosenberg theorem, 446 
Stiff equations, 228-232 
Stiffness, 229 
Stiffness ratio, 230 
Stirling numbers of the second kind, (Prob. 
67) 162 
Stirling’s interpolation formula, 63 
Sturm sequence, 368-371, 508 
generalized, (Prob. 45) 406 
Subordinate norm, 7 
Subtractive cancellation, 18 
Successive approximations, method of, 196 
Successive overrelaxation, 449 
Summation, 136-145 
by parts, (Prob. 11) 278 
of rational functions, 141-143 
Supertriangular matrix (see Hessenberg mat- 
rix) 
Supertriangularization, 516-519 
Suppression, method of, 334 
Synthetic division algorithm, 371 


( Prob. 


t-method, (Prob. 35) 329 

Tabular point, 53 

Taylor-series methods, 166, 195 

Thiele’s interpolation formula, (Prob. 16) 
325, (Prob. 19) 326 

Thiele’s theorem, (Prob. 17) 326 

Threshold constant, (Prob. 24) 544 

Threshold Jacobi method, 504, (Prob. 24) 
544 

Toeplitz convergence, 125 

Transient solution, 229 

Trapezoidal rule, |, 120, 124, 230 
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Triangular matrix, 419 
Tridiagonal matrix, 75, 466, 506 
Trigonometric interpolation, 271 
Truncation error, 3 
local, 166 
in numerical integration, 171-173 
in Runge-Kutta methods, 213-215 


Underdetermined system of linear equations, 
458 

Underflow, 19 

Undetermined coefficients, method of, 44-46, 
168 

Uniform norm, 7, 286 


Variable-order method, |88 
Variable-order, variable-step methods, 
201-202 


Von Mises’ theorem, 493 


Weddle’s rule, (Prob. 44) 157 
Weierstrass theorem, 36-38, 134 
Weight functions, 102-103, 248 
Wilkinson’s method, 496 


Zeros of polynomials: 

computation of, 367-397 
classical methods, 371-383 
effect of coefficient errors, 395-397 

location of, 368 

purifying, 372 

(See also Bairstow’s method; Bernoulli’s 
method; Jenkins-Traub method; La- 
guerre’s method; Newton-based method 
for polynomials; Root-squaring ) 
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CHAPTER 1 


1. 


(a) 2.05265, 2.05375; (b) 10.7015, 10.7125; (c) 18.74486, 18.77916; 
(d) 131.3565, 133.4410. 
(a) .00005; (b) 6.493. 

n 

a-1 

(a) rarest (c) aey* °. 
(a) Product: E = .01715; same as results from 1c. 

Quotient: E ~ 1.0421; from 1d, 1.0504. 
(a) E = .000026; E = .0057; (b) (i) E ~ -.00208; (ii) E ~ .00013; 
(iii) E = .00068; (iv) E = .04262. 
(a) Consider discriminant of ) (u,-xv,)*. (c) Use scalar triangle 
inequality for N3 for both norms. (ad) For t>0, O<m<1 show that 


t™<i+m(t-1); let t=x/y and show that x /Py 1/4 < x + xy then 


n n 
= P P = q q . 
using x, [u, | / yt! and y, lv, | / z lv, | consider 


n 
) x, '/Py, 1/4 in this inequality. (e) Use triangle inequality on 
i=1 
n p n p-1 

= e ~ t 2 e 
are P_ leatvaillestyy! ; then apply Holder's inequalitv 


to u and vector with components Justv, [Po and similarly to v and 


this vector. (f) Weights factor out in all cases. (g) Let 


[v_|=max|v.| and then factor qth term out of expression on left hand 
1<i<n 
side. 
(a) Use scalar triangle equality for N3; for N4 use 
| |A| |=max| [Ax||/| |x|]: for x which achieves maximum for AB, get 
x#0 
| |aB] {=| aBxl|/|]xll=|]acex) [I7l txt < Tall [1Bxtt7} txt. 
(d) x,=sign(a,,)-1 with r as in part c. (e) Show ||/Al|| < max 
j 


} jas,! and choose x= (6; .) where c is value of j which maximizes 
i=1 


sum. (f) [Jatb| |7=| lal ]7+] [bl |7+22Za, jb, 5: then use Cauchy-Schwarz 

inequality to show sum less than ||/a|| ||bl|.  (g) Show | ap} |? < 
2 

} ,{PirPss!!@isPay!, then use 0<|xty|“ to get result. 

r,s 

(a) Use | [Ax] ]=|[xl] [[Ax/IIxl] TI. (b) | For xl =1, ; 

LIAL loops t Axl Ist IAL leon: (c) Flay 4X5) < a3 ) (Ex) < 


2 


(EZay 4°) (Ex, ). (a) By direct calculation of tr (A’A). (e) Sum 


of eigenvalues of A‘A <n o(ATA) = n[ lal |. 


Let x be eigenvector corresponding to eigenvalue of maximum magni- 
tude. 
(b) Show that relevant region in Eye £5 plane is triangular; (c) 
use change of variable s+u-t. 

2 
(b) ¥ - 3/2x#9/8 on (1/2,3/2]; 3/4-x* on [0,1/2). 


A2 


13. 


14, 


15. 


16. 


17. 


19. 


20. 


21. 


22. 


23. 


24, 


(b) n > 16, Pr. 2.1 * 10°. 


n 
(b) 6, < 1/2 x 1079; (c) ©, = ae,_, + 6.3 (£) © = J a 565 ana use 
1 1 n 5 


1 i-1 =0 
(1.4-7). 
1 2 2 n 
Minimize y as +A om a, - a). 
i=1 i=1 


On a binary computer bounds on ¢€ are: 0 for addition and subtract- 


ion, 2-P7! for multiplication and division. 

Since 24 bits are needed for the integer part, only 3 bits left for 
-0,.1,...,-93 therefore, all numbers with .2, .3 expressed by last 
three bits of 010, all with .7, .8 by 110; in general, finite cor- 
binations in real number system can't express all real numbers. 

(a) Without 2p+1 bit accumulator, shifting, adding, normalizing and 


then rounding will give .1000 x2? instead of correct .1111x2°, 
(b) No effect since 2Prtm is an integer. (c) Without replacement 


| m, +m l>y after shift so rounding bit after normalization is in p+t1 


or p+2 bit before normalization; effect of replacement is only to 
zero out all bits of mm, from p+3 on. (da) With a pt2 bit 


accumulator replacement of part b requires addition of 1 in p+t2 
place if numbers have opposite signs and any 1 bits are shifted out 


the right. (e) Consider .1000x27~.1001x2? with a rule like that in 


part b and a five bit accumulator. 
(a) At most one roundoff when smaller is shifted right; use (1.5- 
19) with e€ moved inside first parentheses. (b) First roundoff 


halved when right shift occurs. (c) [xty|>2?-1, just add two 


bounds from part b. 

(a) Cancellation may result in no rounding of numbers added in 
order other than increasing. (b) No rounding if added in order 
shown; rounding at first two stages if added in order of increasing 
Magnitude. 


_ . _ -p 
(b) ti=x y (146) si =(s,_,tt.) (Ite), [6], le |<2 ©. 


(c) s = Ex,y,; (+n, ) where (Ven J=(14+6) (Tte)... (Ite), 


n 
n-r+2 
v 


Y=2,.2-,N3 ny=no! (1-27 Py norte < 1+n_ < (142 P) r=2,.e-,Nn3 


for double precision accumulation before rounding to single precis- 


ion (1-(3/2)277PyP-FF? << den x (14(3/2)277P)P TH, re2,... 40: 
N4=N>5- 
'? q-1 -p 20-1 -p -2971 
(a) With d=2 largest is (1-2 *)2 , smallest is 2 *2 ‘ 
(b) Similar examples to those in Sect. 1.5-4 possible for all 

operations. 

(a) To add (a, ras) to (by +b.) : (i) C,=a,+b,; (G2) Co=a,tbo; 
(iii) Overflow?, Yes*+(iv), NO*END; (iv) c,=c,+10 . 

(b) (i) c,=a,tb,; (ii) cy=a5tbo; (iii) Overflow?, Yes~*(iv), No+ 


(vil); (iv)Positive?, Yes*(v), No*(vi); (v) c,=c,+10°%, Go to 


y=e,- 1079; (vii) Signs of Cy 0C5 same?, Yes*END, No 
(viii); (viii) c, positive?, Yes+(ix), No+(xi); (ix) c,=c,-10°9; 
a 


=Cotl/2+1/2, Go to END; (xi) c,=C,+10— > (xii) cy=c,-1/2- 


(vii); (vi) c 


(x) c 
1/2. 


2 


A3 


25. 


26. 
27. 


28. 


29. 


_ -p 
£2 (x4+xX5X34)= (x4 +x5x,) (1te) where |e|< 2 (14x5x3/ (x, +x) ) for p 
bit mantissas. 


- ~97P 
Ph XyXqee eX, (IFE) where (1-2 *) 
(a) Ill-conditioned because small changes in coefficients, for 


example those in part b, lead to large changes in solution x=y=1. 

(b) x=10, y=-2. (c) Lines defined by 2 equations in part a or b 

are nearly parallel; therefore small change in line leads to large 
change in point of intersection. ' 

(a) af-b. (b) Precisely the same with y identified with YY 


n-1 


MV ettes (1427P) NTT, 


identified with Yo: 
In general stability is good since each iteration effectively re- 
starts entire computation. 


A4 


CHAPTER 2 


1. 


10. 


11. 


13. 


14. 


15. 
16. 


17. 


18. 


19. 
20. 


21. 
22. 


(a) P(x)=x; max error m/2-1 at x=1/2; (b) P(x) =(1/m>) (96-24) x + 
(1/17) (81-24) ; max error .158 at x=m1/2; (c) P(x)=(1/n') (3848-1152) x 


+(1/n?) (384-1207); max error .226 at x=0. 
(a) Suppose there is only one zero; show rotation of line reduces 
error. 


(a) Use induction; (b) n=1: 3t/(34+t) ; max error .0354 at x=1.0; 


n=2: (15t+t?)/(15+6t7); max error .0813 at x=1.0. 
a=-1, b = -2, c = -2; max error .2817 at x=1.0. 
(a) [El<x’/7!; (c) 3% 107'972 410729 
(a) 6; (b) 9. 

(a) P, (x)= (x-a,)/(a,-a5): Po (x)= (x-a,)/(ag-a,)? independence means 
approximation found once and for all;  (b) x7=(a,+a,)x-a,a5! no 
dependence on a and b because approximation exact for linear 
polynomials. 


/4; (a) 3 10°'9; (e) ves. 


-a; (c) show that Ya - x = (1/2x, ) 


(b) Consider x -1 and x n+1 


> n+1 n+1 
(fa - x) . 
(a) y=(x-a)/(b-a); (b) By (x) =2x/; By (x)=4(1 - 13) x7 /0 742 V2x/T; 


B, (x) =8x7/n +6 3 (1-2x/m) x7 /m 743% (1-2/0) 2/m; (Cc) 432. 


(a) Note that $(t) and p(t) are even, continuous functions of 
period 2%; (b) for V(t) consider o, (t)=[F(t - 0/2)+F (-t-1/2))/2 


and w, (t)={ (F(t) - w/2)+F(-t-1/2))]sin t}/2; (c) consider F(t)sin 


t and F(t)cos*t. 

(a) Since F(t)=@(t) and w(t)=0, part a of the previous problem 
serves as the whole proof; (b) F(t)=(1/2) (1-cos t) in both cases; 
plot the error to see why any cos 2t term would increase the 
maximum error. 


n=2: 1/4 (3-x7); n=3: 1/10(8-3x7); n=4: 1/80 (3x'-30x2+67) . 
(a) Final value of m is largest Xi since j's decrease, value 


with largest j is reached first; formal proof by induction. 
(b) Consider Xoreee eX, and probability that X, is the largest. 


(c) Using initial conditions Papo GK Pak@? if k<0 and substituting 
G. (z) into equation of part b, get G n (2) = (24n-1)G_y (z)/n. 
(a) Ge (= tkp nk? then differentiate G, (z). 


(a) mm and m always contain two largest values found; statement 
labeled L assures that largest in m, next in m'; formal proof by 
induction. (b) Let q nk7Prob (k executions); equation for Fak 


same as for Pak in previous problem; 9974947 1/2: same result as 
prob 16d. (c) Let Tak Prob (k executions) ; G (2) = (2z+n-2) 
' 1 
4 (2) /n; G, (1) #2 (H- 1-35) . 
(a) All mand n. (b) Final value of d divides final c and, there- 
fore, all previous c and d from equation r=c-qd; similarly any 
divisor of m and n divides each c and d. , &c) Add to first sree 
a',b+1,a,b'+0; add to fourth step t*a', a'ta, att-qa, t«b', b'+b, 
bet-ab. 
(a) It will converge but perhaps to a discontinuity, not a zero. 
(a) 2.0918 after 9 iterations; true result 2.0946. 
(a) |1-f'(x)[<1.  (b) Iteration diverges. (c) Iteration converges 
to 2.0946 after 6 iterations. 


Only the last is not. 
(a) For this functional n=2; therefore, K (t) = 


2 


23. 


24. 


25. 


26. 


[-(x,-t)? +3(x,-t)? -3 (x5-t)? + (x4-t) 21. (b) E(x?)=-18h?, 


n=3; K (t) == (1/72) (1=t) 2 (3t+#1) for 0<t<1 and =-K(t) for -1<t<0; K(t) 
of constant sign on [-1,1]; E(x) =-4/15; E(£)=-(1/90)£°(E). 

For (2.4-4), K (t)={ (atb) /2-t] 7- (1/4) (b-t) 2 on [0, (atb)/2] and 

~ (1/4) (b-t)? on [(atb)/2,b]. 

For f{(atb)/2): Wig oo™ 1/2: ; Wa Fo y= (1/8) (b-a; for f° £ (x) dx: 
Wao Woon (b-a) /2? W445 7Wo4= (b-a) /12. ; 

(a) Wo 9=b-a-W Gg? Wo4™ (b-a) wy /2- (1/6) (b-a) ? Wo 4% (b-a)w, /2-) 1/3) 


(b-a) °, (b) Same solution as in prob. 25; minimized sum of 
squares of coefficients tends to minimize roundoff. 


A6 


CHAPTER 3 


1. (a) Just use fact that formula is exact for xX, k<n; (b) jet: 
jJ=2: 1 ata j=3: 1 at ay? for small values of n roundoff 


not serious. 
da 


2. (a) Extrema of p(x) at a,th/V3;ho£''* (x) <9 ¥3 x 10 -; 
(b) 2120 x 107971.14;  (c) n=3: sin x: h<1.15*x1073; e*: he. 27x1077; 


sin 100x: h<1.15x107>; n=5: sin x: h<1077; e*: h<1.25x1077: 
u 


sin 100x: h<10.. 


.  (b) (23x? = 63x% - 234% + 324)/162. 
4, (a) Example 3.1: .564634; Example 3.2: .514136; (b) Examnle 3.1: 
-540375; Example 3.2: .495133. 


5. (a) (4) [35"[<.377 |B] <4.7%107"; (ii) [Opt |<.15; JE]<107>; with 


roundoff error in linear interpolation error could be greater than 


5x10; therefore, use three-point formula. (b) .1951, .1384,.9828, 


0288, —.0232, -—.0729, -.1200, -.1641, -.2051, -.2426. (c) Can 
use linear interpolation; results: .5725, .5622, .5480, .5300, 
-5087, .4840, .4562, .4257, .3925, .3572. (d) Six. 

6. (a) .1837, .5712, .3683; (b) -.0001, .5191, .4318; (c) --1154, 
~-4595, .4635; (d) <-.2280, .3719, .4837. 

7. Use induction and (3.3-5). 

8. Use (3.3-5). 


9. a= (1-9) +13 6 = af(d + 1)'72 = ayel??, 


10. (b) For any closed path by replacing portions of the path that are 
three sides of a rhombus by the fourth side and portions that are 
two-sided bv the other two sides until a single rhombus is left. 

(d) Using results of parts b and c any such path can be deformed 
into any other without introducing any new contribution. 

11. (a) Use an interpolation formula with m = 9 for each entry. 

(b) If series converges, then remainder must go to zero. (c) Use 
(3.3-8) for Arf and then (3.3-11) for £(x+keh). (d) 4,=E°-1 
=(144)°-1=94 (14+ (9-1) (A/2) {14 (9-2) (4/3) [14 (9-3) 4/6] }) ; 


Ai=p°A° [1+ (9-1) At (79-11) 4/121; a2=p 20> [143 (p-1) 4/2); 


at=p'a"; (e) p=1/2: .1952, .1383, .0827, .0287, -.0233, -.0729, 


-.1200, -.1641, -.2051, -.2427; p=1/3: .2047, .1477, .0919, .0376, 
-.0148, -.0648, -.1123, -.1570, —-.1985, -.2386; p=-1/3: .1856, 
-1291, .0737, .0200, -.0317, -.0809, -.1275, -.1711, -—.2116, 
~.2486; near the end of the table backward differences would be 
better. 


12. (a) (i) Use binomial theorem; (b) replace 3) 2) by Gob 


r 


13. (a) Consider aE '; result is y (m)=E (m+j-1) a7 £4. (b) For Gauss's 


forward consider ace, result is y (m)=E (m+j-1) 958°? £, + 


2j+1 
23+19 


y(m)=E (m+j) 958° £5+(m+}) 95,46 


(m+3j) £472! for Gauss's backward result is 
aa cere (c) Stirling: 2kth term: 


m>(m?=1). 2. [m= (k-1) 716K Ee / (2k) 1; (2k+1)st term: m*(m*-1)... 


2k+1 


(m2-k7) 16 £y/(2k+1) 1b; Bessel: 2kth term: m(m2—1) wee tm2-(k-1) 7} 


k 


(m-k) ys fy )o/(2k)t; (2k+1)st term: m(m2-1)...[m-(k-1) 2] (m-k) 


2k+1 


(m-1/2)6 fy s2/ (2k+1)! 


A7 


15. 


16. 


17. 


18. 


19. 


20. 
21. 


22. 


23. 


24, 
25. 


26. 
27. 


28. 


29. 


(a) Min at m=1/2, max at m=0, 1; (b) B,-CBo=f (m)=(m(m-1) /2] 
{ (m+1) (m-2) /12-c] ; max at m,=1/2, min at m., 37 1/2 + (5/4+6c) '72; 


set f(m,)=-f(m, 3); get 5760°+144c+7=0; one root c=-.18394 is near 
average value of B,/B. which is -.1805. 

(a) Newton forward: Analogous to (3.3—4)h™m(m=1) 2... (m-k) £ OKT”) (gy 
/k!; (b) Stirling ended with 2k difference: 

n2kt 1 (m2—1) 02. (m2-k2) £ (2841) (£) 7 (2k41) 1; Bessel ended with 2k 
difference: {h7*t'm(m?-1)... (m2 (k-1) 2} (m-k) / (2k+1) 1} { (mek) £ (2K+1) 


(E+ (m-k-1) €PKF7) (ey, 
(a) Difference Newton Gauss 
-9,.8,.7 ~-6,.7,-9 
5 -9,.8,.7,.6,.5,.4 06507725708 eh, D9 
(b) Error should be between -8.0x10-> and -~3.7«107?; calculated 
6 


error is -7.0x10 -. 
(a) (i) Through second difference; (ii) through seventh differ- 
ence. (b) In both cases coefficients of differences are small. 
(a) Use m= (x-a))/h in (3.3-11); (b) use induction; (da) use 


ratio test; (e) asymptotic even when ether because of roundoff. 


(a) 3; (b) 33 (ec) 4. 

(a) Newton forward, m=-.3; .1837, .5712, .3683. (b) Stirling, 
m=.05; -.0001, .5191, .4318. (c) Bessel, m=.4; -.1154, .4594, 
~4635. (a) Newton backward, m=.1; -.2280, .3719, .4838. 

(a) Use induction. 


(b) .40 -.916291 
2.231744 
50 ~.693147 -1.8303 
1.68236 1.684 
70 -. 356675 -1.1568 
1.33531 
- 80 ~.223144 


(c) By repeated application of definition in part a; (dad) use 
first result of part c; (e) result follows since E(a,)=0,a,=1, 
---,n;(f) as asta,fla,e...ra, Jef *) (ay /(k-1) 1; (g) -.509977; 
(h) use induction. 

(a) Use (3.5-1) as in Sec. 3.5; additional entries: anti Xs 
Yn+1 (x), Yn,n+1 (xX) poco, Y1,2,..,n+1'*) > (b) —-.616144, 


(a) .1837, .5712, .3683; (b) -.0001, .5191, .4318. 
(a) 806.16264; (b) 771.42567; true value is 771.41025; large 
derivatives of tan x near 1/2 cause of large error in part a. 


2.4048; true value 2.4048, _ 
(a) x=1.5677; (b) sin x=.9999950; true value of x is 1.5676; 
if f£' (x)#0 on [x, -x,] process will converge for any f(x). 


Set t.(x)=mx+n and use (3.7-7); proceed similarlv for s(x) using 


(3.7-9) ° 
Use J 4 (x)RTQ Cx)? results: .1838, -.0001, -.1154, -.2280. 


A8 


h, -h 6h, (o,-9,) 
, = 32 4. = 3193" °2 


2 1 2 7 
hz +h, (ha + 5 h,) (ho+h,) 
hy-17Py _ Shy 4 (F,79,-1) 
a ns ae | a er 
Aha * 2p (hyiy + x AQ) (hy thy) 
b n-1 F541 
(b) f S"(x) [£"(x)-S"(x)]dx = }  S"(x) [£' (x)-S' (x) ] 
a i=1 
a. 
i 
n-1 Asay 
-j) f S''' (x) [£' (x)-S' (x) ]dx - S'''(x) is constant on 
i=1 a4, 
i 
fay ayy) 
x 
(a) Let G(x) = f [f£'(t)-S'(t)}]dt. Then G(a;)=G(a,,,)=0. 
ai 
x 2 X 5 x 3 
(b) ( f tefe''(t)-S''(t)Jat)* < f 1% at + f [£'(t)-S'' (t))} “at 
Zz z Zz 
x 
(c) By problem 31, E(f) > E(f£-S)£(x)-S(x) = ff [£'(t)-S'(t)]dt 
ai 
(a) M, = -2.0286 M, = -1.4627 M, = -1.0333 My = -.8058 
Me = -.6546. 
(b) M, = -1.5119 M, = -1.0138 My = -.8399. 
(c) Three values of integral: .1735, .1735, .1735. 
Three values of S(.35): 5917, .5916, .5916. 
Three values of S'(.35): -8428, .8455, .8452. 
Dix, b.h 
Cc. = a. , . = 
(a) Cc a. e us e 


(b) Multiply the equation g(k+j) = Ye bv qd. » k=0,...,n and 
sum for j=0,...,n-1 yielding the system 


Yne5 + dau, Yntj-1 tere t d, ¥, =0 . j=0-...n-1. 


(ad) If uy=e"®, ©, B real, p > 0, iS a root of Pr i%)s so is 


_ 718 x x _z _ 


x 
p (A, cos B x + A, 


f(x) has an oscillating component. 


- 96x _ _aoue 7° 18% 


Sin B x), A, A, real, i.e. the approximation 


~554e_ 
-1.29x 
e 


f (x) 


f(x) = (1.43 sin .515x - .35 cos .515x). 


A9 


36. 


37. 


(b) (i) Horizontal: .8034, .8303, .8566, .8823; vertical: .8451. 
(ii) Vertical: .8566, .8478, .8392, .8310; horizontal: .8448; 
for diagonal points just interpolate in single variable along the 
diagonal. 


(b) If Ran (*) = Pin (I/O (X) and Ran %5) = Y5" j=1,...,8, then 


P(X) Q, () - Pn (*) Q, x) = 0. 


(c) Ry, (x) 


3. 
(a) Ria(x) = =! 
11 x-71 ° 
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CHAPTER 4 


1. n=3,. wa,:wi!) = -3/2h, wil) = 2/n, wi!) = -1/2h, wi?) = wi?) 


ifn?, wi?) = -2/n? 
ww ey (I) CL (1) _ (2) 
x=a5: Ww. =W3 1/2h, Wo = 0. The Ws for as and ay 


are the same as those for ay The ws") for x=a, are those 


for x=a, in reverse order. 


n=5. x=a,:wi!) = -25/12h, ws") = U/h, wa!) = -3/h, wel) = 4/3h, 
w,(") = -1/4h 
wy?) = 35/12nh*, wi?) = -26/3n?, wi?) = 19/2n7, wi?) = -14/3h*, 
we?) = 14/12n° 
xeas:wi!) = -1/4h, ws”) = -5/6h, wit) = 3/2h, wi) = -1/2h, 


wa!) 1/12h 


wit) we ttytan®, wi?) = -5/3n?, wi?) = 1/an?, wi?) = 1/3n*, 


1 3 4 
we?) = -1/12h? 
xeagiwil) = wil) = tian, -w§!) = wil) = ay3n, wi” = 0, 
wh) a wl?) = -1yt2n® = wi?) = wi?) = uyan?, wi?) = -5/2n?. 


The wi) for xway and xa, are those for x=ay and xX=a4, 


respectively, in reverse order. 
2. (a) £' (ay) =(1/h) (AE Q-1/2A7£ +1/34 7-1/4 £ 41/5A°£ 0): 


iv _ 4 4 yD 
eV (ag)=(1/h") (A £9-28°£ 5). 
(b)  £'(a,)=(1/h) (VE,41/207£ 41/309 £ 41/40 £,41/50°£,) § 
0 0 0 0 0 0)? 


iv _ 4 4 
f (ay)=(1/h (6 fy) > 


4, Get two linear equations for a,b; result y''+y=x; y=sin xty. 
5. h=1.0: -100; h=.1: -—1.0000; h=.01: -.9900; true value -.9901. 
Y _ _ ml = 
6. Set a; h; , x=0, Yi jiet,...,itm'*) = Ta , M=0,1,...,Nn. 
. m . . 
1 1 i-1 
7. (a) Poi (x) = J DS om 


(c) 


(d) 


j=1 
1,.Ypi Y\\2 
EB. (h Pi-7h )) #0 
6 


Ht 


i,,-6 _ 
m os then Eo ¢) = U 


i -6 i_,- 
= E (nh). Let US = hy 


All 


11. 
12. 


13. 


14. 


15. 
16. 


17. 
18. 


rn ~ and U H T 
(he Dian) 71 
(a) £'(x) = EQANI=EOO _ F £'2) (x) i 
h i=2 il! 
(b) h T, T, T, 
.0128 .49944 
.§0001 
.0064  .49973 50000 
50000 
.0032 .49987 
.0256 .49883 
50008 
.0192 49914 50000 
50004 


-0128 ~49944 
First derivative: .6: 1.79576; .7: 1.73188. Second derivative: .6: 
~.6438; .7: -.7167. 
(a) Factor out H5a,. (b) Linear system has no solution; implicit 
assumption of part a is that solution of linear svstem exists. 
Argument with weight function is precisely analogous. 


Calculated value 


n= 2 n= 3 n= 4 n= 5 True value 
(a) 1.263 3.975 2.047 3.089 2.6516353 
(b) .0128 -0116 -0102 -0099 -0099010304 
(c) -0927 ~ 3019 221171 ~ 1432 1/6 
(d) ~9122 1.0898 1.0823 1.0827 1.08294 
(e) 2.052 2.144 2.199 2.235 2.40394 


(a) Perform successive integration by parts on the orthogonality 
integral. 
(b) Any other set of boundary conditions on U,. (x) satisfying part 


a give a U. (x) differing from that given by conditions of part b 
by a polynomial of degree r-1, but this leaves ¢,. (x) unchanged. 
(a) In orthogonality integral write 9? as ¢. [AL x “+4 a7 (x) 13 


(b) integrate by parts. 
Proof is precisely similar to that of Theorem 4.1. 


(a) Show that x can be expressed as a linear combination of 
(x), k<j. (b) Multiply both sides by $5 00 i < r-1, and 
integrate. (c) Consider coefficients of x* in recurrence 
relation. (e) Use induction; if A.>0 for all r then show at zero 
of ¢, that 5 and $6 have opposite signs; from this get result for 


n=1; similarly, use inductive hypothesis to get general result. 
(b) Use telescoping of series on right-hand side. 


(a) ¢, (x)= Ua / (rok) 11 (17x ) (ak /ax OEP (a) }. 
(b) Yr “2 (r!) 2k!) “/{(areket) ((r4k) 1) 3} (use gamma- function 
integral) ; A= (-1)7 (2r+k) tk!/[ (r+k) 1)? 
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19. 


20. 
21. 
22. 


23. 


24. 
25. 


26. 


28. 
29. 


30. 


31. 


(c) Hy=(Qntk+2)ntkt/{{ (n+k+1) 1179.4 (asdon (as). 

(b) For Yy successively integrate by parts; (c) show a= (2r+1)/ 
(r+1), b=0, beogaar/(r+1). 

(b) For Yr successively integrate by parts. 

(b) For Y, use result of Prob. 12a of Chap. 1. 

(b) Legendre: 2/(nP,_ 4 (as) Plas) 1; Laguerre: ~£(n-1) 117/0. (a5) 


. ‘tae Nene t . 
Ly (a5): Hermite: 2 (n 1) t¥m/(H) (a5)H,_ (a5). (c) Legendre: 
2 2 2,. 2 2. 
2(1 a5) /{ (n+1) [Pig (85)] }; Laguerre: (n!) a;/{L 4 (a5)) ; 
wo n+1 2 
Hermite: 2 nivn/(H, 44 (a;)) . 
Calculated value 
n= 2 n= 3 n= 4 n= 5 True value 


(a) .00990057 .00990098 .00990098 .00990096 1/101 = .00990099 


(c) .7979 7815 .7819 7840 n/4 = .783598 
(a) .4178 2412 3710 2784 1/3 

(£) 1.3475 1.3820 1.38033 1.3803908 Yn e'/4 =1. 3803884 
(b) Bound: e949 

(e) Bound: e 17/3 


(b) For A, use induction. 
(a) Show coefficient of x” in J (e711) is 2/(n+2) times that in 
P +1 (*) by using Probs. 19 and 24. 

n+1 


(b) Express (1-x2)p (x) = J A.P. (x); express (1-x7)g (x;1,1) = 


2 ; ; 
2(xP 44 (x) +P yo (x) 1+ (1-x )p_1 (x) # show A5=0, j=0,...,n-1 using 


orthogonality relationships for Jat Pati! and P for j=n, n+1 


n+2’ 
~y»2 = - = ts ' 
show f (1-x )I Py dx=2h (xP, P +2) 0x: (c) H, 2/{(n+2)Po 14 (a;)Po yo 


2n+3 


(aj)1; E=n![ (n#1) $] 2 (n¢2) 127"*3¢ (2n43) ( (2n+2) 1} 2 (2n) 1}; abscissas 


t 
are zeros of Pity bX) > 


(d) Since a polynomial of degree r in cos 6 is a linear combination 
of cos k@, k=0,...,r; (e) Use cos k@=2 cos (k-1)8 cos 6 - cos 
(k-2)6. 


By change of variable @=cos” x. 
Calculated value 


n= 2 n = 3 n= 4 n= 5 True value 
(a) 2.3888 2.4041 2.40394 2.40394 2.40394 
(b) 1.8138 1.3711 1.6274 1.5028 1.570796 
=m = ° = 7 7T 
(a) Show that S (x) =T 1 4 (x). (b) Yn 1/2; S47 (85) cos jT™; 


Si (aj)=(n+1)/sin® (5m/(n+1)]. 


(b) yp=2/(4nt1); AL=(Gn) 1/(229[ (2n) 117}; 
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32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


41, 


42. 


Pp (45) Py y (a5) = (4nt3)P5, (a5) /[2 (2nt2) }. 


(ce) y,=2/(4n+3) ; A = (une2)1{277*1[ (2ne1) 117}; 
' = ‘ 2 
Py (a5) Prgy (ay) (445) Pong (85) Ponge (as) [2 (2nt3) ay). 


(b) Vy7t/2i A =2°0, Pr (45) Pag y (85) =Tongg (25) Tonge (85)785- 


6 


(a) Calculated result: .531099; |E|<3.1x10 /. (b) Calculated 


result: 3.42164; [E|<6.7x107>. 


(a) y=cosh @. (b) th'(t)=(p+1) [£(t)-h(t))];h(0)=£(0); h* (0) =(p+1) 
£°(0)/(p+2); h''(0)=(pt1)£"' (0) /(p+3). y 

(a) Because of the singularity at x=y. (b) Let Fly) = So {f£ (x) dx/ 
(y-x) 174), approximate F(h), F(2h),...,F(1) using (4.8-18); then 
differentiate numerically. 


(c) gly) =2(d/dy) pTlely sin*6)sin 6 d@: no singularity in 
0 


integrand now. 


(a) In (4.9-11) 2 2n+1 


2n+1 and there are m 2nth 


b m/3-1 

derivatives. (b) Using Table 4.1: f > £(x)dx ~ (h/6) J 
j=0 

{Sf [at (3h/2) (14+.774597+245) ]+8£ [at (3h/2) (1425) ]+5£ [at (3h/2) 


(1=.774597)+25)]}; h=(b-a)/m; E=81mh’ £2 (n) /224,000. 
Calculated value 


becomes m 


n= 1 n= 2 
m= 2 m= 4 m= 8 m= 2 m= 4 m= 8 True value 


(a,b) 1.6 2.4 2.6416 1.26 2.70 2.71 2.6516353 


(c) 2.027. 2.107 2.177 2.052 2.147 2.219 2.40394 


Abscissas are symmetrically placed in interval and if center of 
ata ata 
2 Ryntl ss vero, 


on , the integral of (x - 
2 2 


(a) Use fact that Pret 
interval. (b) Let y=x+th in Thay integral. (c) Integrand of q(x) 
alternates in sign on [ay,asiy], but because Iz5_,l217,| integral 


interval is x = 


(x) is odd with respect to center of 


remains of constant sign. 


a a 

n (n+1) a (nt1) n 
(a) f° ¢£ (E)D 44 (x) dxef (n) fo" Pryy Ox) ax. 

a 1 

n-1 n-1 

Consider quadrature formula j' f(x)dx = (2/n(n-1)) [£(-1)+£(1)] + 
n-2 -1 
} H5f (a5) which is exact for polynomials of degree 2n-3 or less; 
j=1 


get y' f(x) dx=0 since abscissas are zeros of Pn tx) ? therefore, 
-1 
tH.f(a,) in given quadrature formula is zero from which result 


follows. (b) (2-n)/n smallest abscissa after -1. (c) For n even 
set 2m-1=n+1 and consider inequality (2-n) /n<B,; for n odd set 
2m-1=n e 


(a) Use an argument based on symmetry; (b) n even: (1/n1)£'") (¢) 
fAn XP 41 (x) dx. 
a) 
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43. 


44, 


45, 


46, 


47. 


48, 


49, 


50. 
51. 


52. 


(a) Show resulting determinant when solved for f{f(x)dx is exact 
for polynomials of proper degree. (b) Cotes numbers are solutions 
of (4&.4-1) with j=1 replaced by j=0 and a,'s set equal to equidis- 


tant abscissas; AL is matrix of coefficients in this system of 
linear equations. (c) Show that Sante 
(a) 8, = 73 | S, = 0-1/2 12 
0 1/2 1/2 
(e) If a is vector of a,'s then S50 is vector (Hy -H,/H,)- 
(a2) In Lagrangian formula use abscissas aj=xtjh, j=t1,...,+%k; 


express Pop (*) in terms of 3; solve for nz (2K) (ey, (b) Subtract 


4h/45 times formula of part a with k=2. (c)}) Add h/140 times 
formula of part a with k=3. (dad) Subtract 9h/700 times formula of 
part a with k=3. 


(a) Is I, + my (I5-1,)/(mj-m,) . (b) m=4: 1.1; m=8: 1.0987251. 
(c) 1.098640. 
Derivation of extrapolation assumes constant derivative; when 


derivative is constant get monotonic convergence. 
(a) Derive (4.10-13) from (4.10-21). (b) Proceed just as in part 
a. (c) Assume true for Tn k and use induction. (d) Th , uses 

, f 
244 points on each of 2k subintervals but Newton-Cotes formulas 
with these numbers of points have different error terms. 


(a) From (4.10-23) and (4.10-24) show c_,, rd ACE 


m+ 1 
(4 Cm3 "Sm, 5-1! and use (4.10-27) to get same relation. 


(b) Show ) 2/4)-1) converges and thus deduce product converges; 


derive bound using m(14a,)ser"4. (c) Show Com??? then replace 


j by m+1-j in recurrence relation of part a and use induction. 

(a) Use (4.10-24) to express Tak in form of sum of ordinate values. 
(b) Consider any term in z? from expansion of numerator of 
(4.10-24); corresponding to this there is a sequence of terms in 

21 which are 1/4, 1/16, 1/64,...,etc., as great as z? term in 


magnitude; since 1<i get result. (c) Using parts a and b and 
yo 3 


£ 


fact that signs of c_|, alternate, get ds? 0! since longl<I/3lenol. 
get 1/3¢919 <4 5m< mo? mo is minimum for m=0 (and equals 4/9) and is 
maximum as m7, 


Ty 570541, 542) 72,57 O541,542, 543; CTS: 
T1 00 To 0 T3 0 Ty 0 Ts 0 True value 
(a) ~549 2.278 2.584 2.654 2.65186 2.6516353 
(b) -0022_. .0078 -00977 009895 -0098970 -00990103084 
6.0x10 0204 -1784 1683 - 16659 1/6 
1.333 1.077 1.084 1.0831 1.08296 1.08291 


(a) n=1: Ho@4/15, H,=2/5; n=2: Hy=4/105,H,=16/35, H,=6/35. 
(b) n=1: H =4/3, H,=2/3; n=2: Hy=4/5, H,=16/15, Ho=2/15. 


(c) Calculated values: Prob. 33, n=1: .4828, n=2: 5319; 
Example 4.6, n=1=2: 2.6666667. 


AIS 


53. (a) Use integration by parts. (b) pe P,, (x) dx"H) ([p, (b) -p, (a) ) 7 
b = =- = b - - 
J xp, (x) dx=H) [bp (b)-ap,(a)]. (ce) Hy sa 1, (x) dx Hy{ {p, (b) / (b 
a,)py (a5) )-{p, (a) /(a-as) py (as) 1}. (ad) For k=0,1,...,n-1 use 
result of part c; for k=n,n+1 use result of part b; for k>n+1 write 
xKayk2 (ya) (x=b) + (atb) x& | aba? and use induction. (e) Same 
argument as Theorem 4.1; n=1: Hy) = + 1 /3, H, sh, ay =a=h 
(1 + /73)/2; in composite formulas interior end points of intervals 


drop out. 
55. 2.6518; .009870; .1666670; 1.08294; 2.40394. 


56. (a) g(0) = 0, g(1) = 1, g'(x) = (x). 


(b) 6 "®) ¢x) > 0 as x + 0,1 for all k. 
57. (b) Integrate by parts. 


58. (b) When x = 1/2, left-hand side of (4.13-2) becomes t/ (et /741) . 
(c) Bo, (1-X) Bo, (x) and Boygy (17x) - ~Bop yy (X) since Bonny = 0. 


(d) If Boney (*) has five zeros, Boy (x) +Bo) has four from (4.13-8), 
then from (4.13-7) Boy 1 (X) has at least three zeros in the 


interior of [0,1] which means five in [0,1]; but B, (x) has only 
three zeros. 


59. (a) By part d of previous problem can apply second law of mean to 
(4.13-15)3 (b) since Bont (*) is odd. 
60. (b) Boye (0) = (2k+2) (2k+1) BB). (c) Since Bo), (0) =0 from (4,13-7) 


sign of Bo, (x) is sign of Bo, (0). (da) Consider sum formula with 


m and m+1; because of part c first neglected term in m+1 case less 
than error in m case. 

61. (b) Show t/ (etl t-e7 t/2) 22 (t/2) (et 2-1) -t/ (e*-1) ; (c) use parts a 
and b together; (d) use fact that Bo, (x) has maximum value at 
x=1/2. 

Xotnj n-1 
62. Proceed just as in (4.13-17); I, f(y)dysh J £(xp+(j+1/2)h]) + 
0 


(h/24) [A£)-A£_y]+... j=0 
63. (a) Use (4.13-17) noting that Enno for polynomials of degree 2m 


or less and differences of order greater than 2m are zero. 
x +nj 
(b) f 9 ~ €(y)dy=h(1/2£ +. ..+1/2£,)+ (h/12) (uSE -vSE,) « 


64. (a) suécessive approximations are 1/2, -583, 875, -5790, .5748, 
-5823, and then results get worse because of poor asymptotic con- 
vergence. (b) Using terms through By, get .5772156; good 
asymptotic convergence. 

65. (a) 2.6515656; (b) 2.6598039; poor asymptotic convergence of 
Euler-Maclaurin formula causes corrections to give worse result 
than without. 


66. (a) Ex(b-a)h™t@p, ie (22) (£) /(2m+2)! where h = 
trapezoidal rule uses n_ subintervals. 
= . 24+2 = 

(b) JE|<E. | (b-a)h Bo 542Mo542/ (2mt2)!| , j=i,...m where 


je 62542) (x) | 


bra and the 
n 


= max 
a<x<b 
(c) If T is the period of the smooth periodic function f(x), then 


e62k-1) (a) = g 2E-1) (aur), Kat,... me 


(d) The error expansion starts with n2mt2 


(e) 1.00000, 1.16667, 1.15476, 1.1547005. 


M9542 
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67. 


68. 


70. 


71. 


(f) 1.03128, 1.19781, 1.10004, 1.19806. 


(a) Since x () is a polynomial of degree k in x, xXx 
polynomial of degree k-1. (b) Use (4.1321). 


(k) 


(a) B= ol 72, (by s = 5 = J (3x21) / fx (2-1) (x741) 77, 
3 


(c) use S= J (x-1) 203) 301/18; then S = ty - J (6x°+11x74 
1 1 

6x) /[x> (x+1) (x+2) (x+3)]; four terms of £1/x’ gives 1.0788; four 

terms of new series gives 1.0818; true value 1.0823. 

(a) Show by induction that Pet) Fea /2%, 


(b) Use induction on m. 
(a) Use induction on k. 
(c) S = .785398. 
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CHAPTER 5 


1. (a) If-construct as in (5.1-3); only if-show that if highest-order 
derivative must appear on some right-hand side, then can't construct 


as in (5.1-3). (b) y'=u,; Z'EVii UUs; u5=sin y-u,v,-xy7zu, ; 
vi=-(ysustve/*u,-eY7)/u,. 
2. (a) ¥Cxpth)-¥ (xp)-. by (x0) /mtah™? TyM*T (ey / (met) | 


(b) Compute y), i=2,3,4 from differential equation; computed 
values; .2: 1.0204, .4: 1.0860; true values; .2: 1.020408, 
~-4: 1.086957. 

3. Show can solve for Yn+1° 


4, (a) 306a=-413b+468, 34c=13b-20, 153d=-Sb+9, 34e=-b+12, 51f=31b+36, 
34g#37b-36; for b#2/3, a®17/27, c=-1/3, d=1/27, e=1/3, f=10/9, 
gz-1/3. (b) 8as-19b+27, 8c=#11b-19, 24e=-b+9, f=b, 8g=19b-27, 
12d=5b-9; for b=1, a=1, c#-1, e=1/3, f#1, g=-1, h=-1/3. 


5. (a) j aj b c d e 
) 1 55/24 -59/24 37/24 -3/8 
1 1 8/3 -5/3 4/3 -1/3 
2 1 21/8 -9/8 15/8 -3/8 
3 1 8/3 -4/3 8/3 0 
6. (8) Yngy = Yt YG 
Yuer 7 ¥, + :W(3/2 yt - 1/2 yt_4) 
Yaet 7 Yp_ ¢ W(23/12 y' - 4/3 y'_, + 5/12 y!_,) 
Yne1 = Yp + W(S5/24 y" - 59/24 y"_, + 37/24 y"_, - 3/8 y*_4) 
(b) Yn+1 “= Yn * h Yn+1 
Yner ~ Yq + (3/2 vagy + Yq) 
Yer TY, + W(5/12 yi,, + 2/3 yt = 1/12 y*_4) 
Yner ™ Yn + (3/8 vty, + 19/24 y',, - 5/24 y'_, + 1/24 y"_,) 
(c) Yn+1 ~ Yn + AY A+ 
Yn+1 ~ 3 Yn ~ 5 Yn-1 * 5 NY ne 


9 2 6 
Ynet = TT) Yn 7 TH Yn-1 * TT Yn-1 + TT? Yne1 
4&g 36 16 3 12 
Yn¢+1 ™ BB Yn 7 35 Yn-1 * 5B Yn-2 ~ 5B Yyn-3 + 38 h Yuet 


7. If f[G(s)ds¥0 let y(s) be such that y(s)>0 on [a,b] and fy(s)G(s)ds= 
0; if SG(s)ds#0 then any y(s) such that /fy(s)G(s)ds#0 will do. 
8. (a) n=1: G(s)*(b-s) (a-s)/2 is of constant sign on [(a,b};: SG(s)ds= 


-n?/12; ne2: on [0,h], G(s)=s8"/s-hs°/3<0, on {(h,2h], G(s)=(2h-s) > 
(h/6-8/4)<0; G(s)ds=-h°/90. (b) a < 1/2; Em[(1-3a2)/3]£'' (E); 


at a=1/ /3 formula is Gaussian two-point formula and order of 
accuracy is 3. 
9. (a) b<9/5,G(s8)<0;b>12, G(s)>0; G(s) could change sign twice in 


interval (x, Xs 44) (b) Use y=x°; E= (-216+86b) hoy”= (n) /17%6! 
v 
10. (a) 4=3, yes, E=14h°Y’(n)/45; j=2, yes, E=141h°-Y’(n)/240; j=1, 
yes, E=16h°y” (n)/45; j=0, yes, E=251h°-Y”(n)/720. (b) a.<1, G(s)<0; 


1 
a,>9, G(s)>0; E=(-9+5a,)h-¥” (n) /360. 
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11. 


12. 


13. 


14, 
15. 


16. 


17. 


18. 


19. 


20. 


(a) Stability equations are p 


rpt 1) 


p+1 - . Ppri_ p- 
(1+hK, , b_,) + 1r5 (hKy5b_4) = dt (aa hK,,b;)r, *-nK, 5b, rb 
p i 
pr = - p- 
and rj) (hK, +r” "(1+hK, gb_ 1) ty! hK,,b,ry °+(a,-hK5b,) 


re * ; in seneval these two coupled equations are very difficult 


to solve. (b) Set h=0; get uncoupled equations of form (5.4-14). 
(b) As h*0 first column of numerator determinant becomes all i's 
and ith column becomes all Yo ' 8: therefore, determinant is zero. 


Pp 
(a) For equation y'=0,y(0)=0,y, = ) hry is solution of (5.2-5) 
i=0 
as in proof of Theorem 5.1; it complex roots are Re +16 Yx has 
tok 2 4.12,2k _ 
term Ww, =hR cos k@; then (weaw +1" KH ,)/sin 6=h°R': for conver 


2 _ k 
gence Wy-Wy 4 must~*0; therefore, R70. (b) Let WwW, shkr 
and proceed as in part a. (c) For all types of roots show Yn79 
as n+, 

= 8 
(b) Yn+4 Ynthyy: 
(a) By induction; take absolute value of both sides of (5.4-25). 
(b) Use fact that La,=1. (c) Use second equation of (5.2-6). 


(a) e,=cr5 + E/Kho; if c = \ - E/Kho then e, = A and e, > X, 


0 
i=1,...,p. 
(a) If z= rer? |r| < 1, show real part of w is negative. 
(c) Since n=p + 1, c, = -a, + hKb. ,. (d) Need d, > 0, 

i i-1 i-1 0 
d, > 0, d, d., - dod, > 0, d, > 0; for hK = 0 get -.6 ¢<b< lt. 
(e) b=0, 0 < hK < 8/3; b = 1/2, 0 < HK < 24/17; this result 
determines absolute stability; the range of positive hK for which 
a method is relatively stable will be smaller than the above 


ranges. 


n-1 “F 1 2 oe 2 
(a) Consider Lot Lo CeXpar) 7 Y ocn-s*pts? }:; by 
manipulation get quadratic form with coefficients Als! by comparing 
polynomial with coefficients Cc. with (5.4-14) get Cea, 1° 


2 = - - e 
(b)  App=Azou(63+2b-b°)/64; = Any =Ay "Aa y=Ay = (-94+10b-b") /8; 


2 
A,4=(11-10b+5b )/4.  (c) ea y*Khb, 1° 


of hK than in previous problem, but paper by Emanuel indicates how 


these higher powers can be avoided. 
(a) With p=3 and using method of Prob. 16 with hK=A get d= 10/9, 


d= (20+32A)/9, do= (644241) /9, d= (164-241) /9, d= (164- 242)/9, 
dy=(16-30A)/9; from this get 0 < A < 8/15. (b) Consider conver- 
gence only; get dy=2-2b,d,=(-3+11b)/4, d= (27-3b) /4, d,=0; find no 
b for which principal minors all have same sign. 

(a) Since rg=1-Kh+0 (h*). (b) Merits: guarantees no absolute 


increase of error and it is comparatively easy to compute when a 
method is stable; limitations: doesn't relate error to true 
solution and ignores K > 0 case. 


(a) h< 3. (b) They? (£)/s40=-5203e °/540. (c) b=0: h < 8/3; 
bel: h < 3; beO:T=4096e7°/1215, bet: T=243e°/90. (da) Show that 


(da) Get higher powers 


Al9 


21. 


22. 


23. 


24. 


25. 
26. 


27. 


28. 


29. 


30. 


31. 


the kth derivative of y, where y'#=f(x,y), contains a factor df/dy 
raised to the k-1 power. 


(a) Integrate from 0 to h. (b) p=0: h7¥'' (n) /2: p=1: Shoy''* (n)/ 
12; p=2: 3h'y?%(n)/s. ; 
(a) Integrate from -h to 0. (b) y (Pt2) (ny f ,X (xth) .. . (xtph) dx/ 


(ptt)! (c) pO: y =v thy',,, pa-h2y'"(n)/2; pel: Yong =¥nt (h/2) 


3 
(yragty,) s Techoy''*(n)/12; p=2: yi yey + (h/12) (Sy). 4+8y iv) 4), 


te-h'y?Y (ny 20. 


(a) Integrate from -h to h. = (b) ph x(xth) .. (xtpny ¥'PF2) (n) ax/ 


(p+1)!; integrand changes sign. (c) p=0 and 1: Yao Yn-1t2)Y ny! 
from Table 4.7 with n=2, Teh°¥'''(n)/3; p=2: yp, 4"¥,_4¢ (h/3) 


te e e 
(TY), 2y n-1*Yn-2)° 
(a) In (3.7-14) set f(x)=y(x), X=X 41! As=Xy 544! since n=p+1 and 
r=q-s+1 result follows. (b) p=#2, q=1, s=0. (c) Yat oY nt n-4 


-17y,_9th(-18y"_,-6y!_»): T#3h°¥¥(n)/10. 


Modifier equation is yO) ay O)49/10ly, -¥ 


(a) Show result is exact for y=x>, (b) If predictor-corrector 


a» ¢ n+1 
pair is of order n, show exact for x ° 


(a) ap=t-aj-ay, b_y=(9-a,)/24, by= (19+13a,+8a5)/24, 
b,=(-5+13a,+32a,)/24, bo= (1-a,+8a,)/24. (b) Let n=0, -1, -2 and 
solve for C33 then by factoring out (r-1) from (5.4-14) show 

p,teo=-(a,ta.) and p4p."a,. Need |1+a,+2a,/>1; in Table 5.2 a,=0 


and 9/17 satisfy condition. (d) (attateas) 1/2, 


get T=(-19+11a,-8a,)h¥’ (n) /720. 
(a) dy=2(1-a) -hK (1+a) /3, d,=2(at1)+nkK(1-a), d,=hK(1+a). 
(b) d,>0 implies hK < (6-6a)/(1+a), d,>0 and d.>0 do not restrict 


1 
o = VI3 - 3, a, = 4 - VI3, b_, = 


(fT3 + 1)/12, by = (10 - 2 ¥73)/3, b, = (19 - 5 YT3) 12. 

(a) Predictor error: 1uh°-y” (n) /45; corrector error: ~h>y’ (n) /90. 
. Yn+1 Yn+1 Yn Yn ° 

(b) For (5.5=-1) integrate Lagrangian formula based on x, and Xt 

from 0 to h; for (5.5-2) integrate formula based on x, and X-4 


from -2h to h. (c) No, only if method has order p+1 can 
Lagrangian formula be used. 


(0) 
n J 


(e) Use y=x° to 


0 
Kh for K > 0. (c) a=4- /7T3; a 


h 
(b) Show (d7*I/axt*I) [x™ (n-x)* . (r+) {/(41(r-3) 8]; get 
r-1 : : : . 
: (1) FC (reg) 1/7051 (2-5) 1D E17) Fy OF) (ny (1) Fy FF) (OV + 
j= 
Q y'(x)dx; set ker-j. (¢c) Use expression in part a, apply the 


second law of the mean and then integrate by parts r times. 
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32. 


33. 


34. 


35. 


36. 
37. 


38. 


39. 
4O. 


41. 


(a) Use Newton's backward formula with m=1. (b) If r differences 


are retained nrtty(r+1) (py, (c) Method requires more past values 
than ordinary predictor to get same accuracy; but it requires one 
less evaluation of f(x,y); further disadvantage is that no error 
estimate available. 

'e(- - ° '=-(— - - . 
(b) yo=( Vly )+18y, 9y5+2y,) /6h; y,=(-2y,)-3y,+6y, Y3)/6h; 


'= - e te(— -~ ° e 
Y¥5=(yq-6y,t+3yoty5)/6hi y3=(-2y)+9y,-18y,+11y5)/6h; errors: 


-ni yu, h/12, -n'/12, nyu, respectively. (e) Not a special case 
of general operator since (Xo, Y>) is not a solution point. 


(a) Method best when f(x,y) is polynomial in x and y. (b) Show 
A. is proper coefficient in Taylor series. 


(a) y(.01) = 1.010101, y(.02) = 1.020408, y(.03) s 1.030927; 
true values: 1.01010101..., 1.02080816..., 1.03092781.... 
(b) y(.1) # 1.111, y(.2) = 1.248, y(.3) © 1.417. 

(b) Only for first equation is G(s) of constant sign. 

(a) At least five in order to retain fourth-order accuracy; 
y (x,-3/2h) , y (x ,-1/2h) . (b) Use Yn-2' Yn-a' Yn-6" etc.; 


y through Yn-6° 


n-1 
(a) Calculated values True values 
x = .1 x = .2 x = .3 x = .1 x = .2 x= .3 
A -904837 .818734 - 740838 904837 .818731 -740818 
B -000050 -000800 -004050 -000050 -000789 .003931 
Cc - 900625 -804667 - 714625 - 900623 - 804631 -714430 
D 1.020116 1.081866 1.189450 1.020134 1.082182 1.191380 


(b) yorl, yy=1-x, yo=t-x+x/2, etc., getting at each stage term of 


expansion of e *. (c) .904837, .818733, .740837. 
(da) (i) (ii) 
x - .1 x = .2 x = .3 x = .1 x = .2 x = .3 


A - 905000 ~819025 ~741218 - 904833 818723 -740808 
B -000044 -000784 003948 -000046 -000781 -003919 
Cc - 900644 - 804742 714676 -900625 - 804632 ~714427 
D 1.020030 1.081819 1.190529 1.020131 1.082174 1.191357 


(iii) 
x = .1 x = .2 x = .3 


- 904838 - 818731 - 740818 
-000050 -000790 - 003931 
-900624 - 804631 - 714430 
1.020134 1.082183 1.191378 
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(a) Estimated errors: Milne: x -8*107°, Hamming: ~18x107°, 


(a) Milne's method satisfies criterion of Prob. 27c, Hamming's 
does not but the differences between the two are not great; (aé + 


a‘ + as) \/? is 1 for Milne, approximately 9/8 for Hamming; not 
surprising differences don't show up in Table 5.6. (b) Show 
truncation error very small. (c) Increasing interval will 
decrease total error as roundoff dominates truncation and it will 
speed up computation. 

Results at x=1.0 using 10-digit floating-point arithmetic 

(d=10, m=8):; 


A2! 


42. 
43. 


aq, 
45. 


“7. 


48. 


49. 


50. 


51. 


A B Cc D 


(a) - 36769539 - 36883001 - 29081685 5.9839745 
(b) - 35623574 - 32688895 ~ 27727846 4.6883179 
(c) - 36778742 - 36821519 29131811 5.6775766 
(d) ~ 36787955 - 36787700 - 29099004 5.8555604 
(e) - 36787888 - 36785361 - 29100297 5. 8636981 


True 
value .36787944 - 36787904 -29098835 5.8510379 
(b) Consider f(x,y) such that £70 and Df =O. 


(a) From (5.8-36), Wo¥0; from (5,8-40), w W570. (b) By induction; 


suppose ky has a term e(£)) 0 D5 f;: then from Bet, w(K hot.) (e73y) 


- M-1 
term, Kye has term c(t.) Dof. 


(b) From (5.8-36) use aoW5 = 1/2. (c) Use (5.8-40). 

(b) a5=0 means fourth equation of (5.8-40) cannot be satisfied. 

(c) a5=a,72/3, Ww y=1/4, Wo=3/4-w,, B,,=2/3-1/4w3 B45=1/4w,: a,=0, 
a,=2/3, wo=3/4, w,=1/4-w,, B44=-B371/4w3. 

(a) To show a, proceed as follows; number equations in (5.8-26) 
from (1) to (8) and perform following manipulations fa, (4); means 
a, times Eq. (4)): (9): a, (4)-(7); (10): (6)-a, (4); (11): eliminate 
Wy between (8) and (10); (12): eliminate B35 between (9) and (11); 
(13): A504, (2)-(ajtay) (3)+(4); then from (12) and (13) get ay=l. 

(b) a5=0, (8) cannot be satisfied; for a,=1 (4) and (7) are 
incompatible since ay,=1. (c) aj=a,=1/2, W,=W,=1/6, W,=2/3-W,, 
B,5=1/6w3, By o=1-3w3, 8, 3=3W3: ao=1, a,=1/2, w,=1/6, w3=2/3, 
Wi=1/6-W,, B45=1/8, Byot-1/12wy, By a= 1/3W,: a,=0, a,=1/2, Wo=2/3, 
Wywl/6, w,=1/6-w,, B4o=1/12w,, Byo™3/2, By =6w,. 

(a) ao=1/2 implies B35 is infinite unless a,=1/2; 6a,a,-4 (a,ta,) + 
3=0 implies By and By3 are infinite. (b) When B35 is infinite 
W4=0; when Byo and By, are infinite w,=0. (c) If w, term is 
omitted last equation of (5.8-26) will never be satisfied; if W3 
term omitted arrive at similar contradiction. 


(a) x? : R - K exact, (5.5-11): -n?/6; y: R- K: hy, /6+0(h'), 


(5.5-11): -h°y(n)/12. (b) x": R- K: @ 10°2h?; (5.5-33): 
-.6h°; y: R- K: -hy./120 + 0(h°), (5.5-33): -h°¥(n)/40.  (c) In 
the Runge-Kutta cases at most one term in (5.8-30) is nonzero; y is 


probably a better test than x™ since in practice fy will generally 


not be zero. 
(a) Each (k, -h, f n) starts with an h? term; therefore, cross 


products start with nh" ; together with multiplicative factor h have 
at least h? - (b) Since no crossproducts, equations for Vir Y ij 


same as those for Wis Bi 5° (c) Yes; result of part a holds in 
general. 


i 
Use y'=z,z'=F(x,y,z); if F independent of y', k,=h, (z_ + J 
i-1 J 

Byymj)- my=hlF (x taj;h yy, + 7 Bs 4ky)- 


A22 


52. 


53. 


54. 


55. 


56. 


57. 


58. 


Solution of D is y= (2/cos*x)=1; as x increases derivatives of y 
grow very rapidly. 

(a) Show by induction that for p-stage method, r(h) is a polynomial 
of degree p. Then use definition of order applied to (5.4-1). 
(b) -1/1440 

(c) r(0)=1, r'(h) < 0, h>o 

(a) xr(O)=1, r'(0)=-1. Determine value of r(h*) where r' (h*)=0. 


. . _ _ 7 _,, (0) 
(b) Use Hermite formula with ay=X 1 ag=x (c) Tes [Yaa Yn+1! 
/30. 


(a) Write (5.9-8) as y,,=y +hEbjy'_; 


n-1' 


let K, M be 


2 1 
ith IB yy 


-i’ 
characteristic values of fe Exy! then get yPt! 
P e 
h* (K74MB_,)] = xP + J [-hkb,+h?(K74mM)B,JxP"*. (b) only root of 

0 


[1+hKb_, - 


convergence equation is r=1. (c) Equation of part a has a single 
root only for all hk. 

Results at x=1.0: Hamming: .36787901; (5.9-5): .36788116; true 
value: .36787944; Hamming's method better mainly because starting 
values have error of opposite sign to that produced by this method 
but (5.9-5) produces same sign error as starting values; therefore, 


at x=.6 errors are 2x10 ° and 63x10, respectively. 
Results at x=1.0: A: 36787963, B: .36788031, C: 29098801, 
D: 5.8497670; could have started at x=.2. 


h To T; Ts 
1/2 23322 52741 
3333 14614 
1/4 -3330 49146 3333 33122 
-3333 31065 

1/6 3332 05768 
(a) 
h T9 Ty Ts T, Py 
1/2 - 375000 

- 369792 
1/4 - 371094 - 367914 

- 368031 - 367880 
1/8 - 368797 - 367880 

- 367890 
1/16 - 368116 
1/2 -500000 

- 361328 
1/4 - 395996 - 368114 

- 367690 - 367879 
1/8 374767 - 368883 

- 367871 
1/16 - 369595 
1/2 - 308594 

- 295196 
1/4 298545 - 290805 

- 291080 - 290987 
1/8 - 292946 -290984 

- 290990 
1/16 291479 
1/2 5.351644 

5.832252 
1/4 5.712100 5.845939 
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5.845084 5.850817 
1/8 5.811838 5.850748 5.851033 
5.850387 5.851033 
1/16 5.840750 5.851028 
5.850998 
1/32 5.848429 
(b) 
A 1/2 - 375000 
- 369792 
1/4 - 371094 - 367940 
- 368146 - 367880 
1/6 - 369456 . 367884 
- 367950 
1/8 - 368797 
B 1/2 -500000 
- 361328 
1/4 - 395996 - 368297 
- 367523 . 367879 
1/6 - 380178 - 367905 
- 367810 
1/8 374767 
Cc 1/2 - 308594 
295196 
1/4 - 298545 - 290673 
2291175 - 290976 
1/6 - 294451 - 290957 - 290989 
- 291012 .290989 
1/8 - 292946 -290985 
- 290992 
1/2 - 291860 
D 1/2 5.351644 
5.832252 
1/4 5.712100 5.842860 
5.841682 5.849897 
1/6 5.784090 5.849458 5.850920 
5.847514 5.850892 
1/8 5.811838 5.850732 
5.849928 
1/12 5.832999 
59. xXx A B Cc 
1 - 367880 271829 (1) - 785397 
2 - 135336 -738909 (1) 110715 (1) 
3 ~-497874 (-1) - 200857 (2) -124905 (1) 
4 -183158 (-1) ~-544987 (2) -132582 (1) 
5 673803 (-2) -148415 (3) -137340 (1) 
6 -247879 (-2) -403434 (3) -140565 (1) 
7 ~-911897 (-3) -109665 (4) -142890 (1) 
8 ~-335469 (-3) -298101 (4) ~-144644 (1) 
9 2123412 (-3) -810325 (4) -146014 (1) 
10 454010 (-4) ~220270 (5) 2147113 (1) 
Tabulated values are those of Tyr using h,=1/2", i=1,...,4 . 
60. The calculations blow up for reasonable values of h. 
61. For equation (5.4-1), Ynt1/Yn = ue . If Re(-k) < 0, rel < 1. 
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CHAPTER 6 


1. (a) Coefficients for m= 5: 
9.0000 0 3.7500 0 2.7656 0 
0 3.7500 0 2.7656 0 2.3876 
3.7500 0 2.7656 0 2.3876 0 
0 2.7656 0 2.3876 0 2.2080 
2.7656 0 2.3876 0 2.2080 0 
0 2.3876 0 2.2080 0 2.1145 


(b) Coefficients for m = 5: 


9.0000 0 1.1250 0 1.4121 0 

0 3.7500 0 1.2890 0 1.6351 
1.1250 0 2.8476 0 1.5159 0 

0 1.2890 0 2.6184 0 1.7516 
1.4121 0 1.5159 0 1.2890 0 

0 1.6351 0 1.7516 0 2.5937 


2. (a) Show that the ith component of £-Qa (™ is given by (6.2-2). 
(b) Form is positive definite since oto is positive definite. 
(c) Consider e-ca ™T¢-o4(™ ), (dad) Only first term in part c 
affected by choice of aM) | 

3. (a) G5,=0 if j+k is odd. (c) Yes; can decouple if @, (x) are, 
respectively, odd and even, 


4. (a) Use a.=i, b,=k-1. (b) Use a,=1,2,...,i-1, i+1,...,n; 
i k iis i+}. a3 
b.=0,..., j-2, j,-.+, nol. (a) hej=(-1)*7IJarI za, 
5 4 n n‘on 
(f) Upper triangle of He is 
25 -300 1050 -1400 630 
4800 -18900 26880 -12600 
79380 -117600 56700 
179200 -88200 
44100 
5. ay a, as a 
m= 3: 2.0001004 2.2501083 0313056 -0020859 
m= &; 2.0000374 2.2501083 -0318428 -0020859 
m= 5: 2.0000374 2.2507475 -0318428 -.0005568 
ay a. Det 
m = 3: 14.13 
m= 4; -.0005233 -1.269 
m= 5: -.0005233 -0020571 .0252 
6. (a) 
ao a, ao a, 
m= 3: 2.0105356 2.2513600 -0208705 -0008340 
m= 4; 2.0105472 2.2513600 -0209302 -0008340 
m= 5:3 2.0205472 2.2512948 -0209302 -0006904 
ay ae Det 
m2 3: 198.7 
m= 4; -.0001209 341.5 
m= 5:3 -.0001209 -0002625 420.1 
(b) 
99 a4 Bo 33 
m= 3: 2.0001004 2.2501090 -0313056 -0020850 
m= 4; 2.0000368 2.2501090 -0318488 -0020850 
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m= 5: 2.0000368 2.2507514 .0318488 -.0005710 


ay a. 
m= 3: 
ms q; -.0005291 
m= 53 -.0005291 0020674 


Difference between 5 and 6b accounted for by differing roundoff 
errors in solution of two systems; expect 6b to be better since 
system is much better conditioned; low-order coefficients are 
quite inaccurate. 


-1 
7. (a) Corresponding to (6.4-11) use Uy = Vy: us = V5 - ‘ di5¥5° 


0 
r=0 
(b) Use integration instead of summation in (6.4-12). 
8. (a) Consider zeros of odd multiplicity and use argument similar to 


that used to prove Theorem 4.1. (b) Show that if pi”) (x) = 

(x-x qd eee (xx, ) then all conditions are satisfied: show uniqueness 
of p'™) (x) by considering another such polynomial qi”) (x) and 
using orthogonality to show that p 6”) (xy -4'") (x) = 0. 


9. (a) Use (6.414). (b) Just differentiate (6.4-27). 
10. (a) by)=2.0131440, b,=2.2516466, b.=.01826140, b,=.0005471, 


2 
by=-.0000498, b.-™.0000538. 


5 
bal) a4 a a3 
m= 3: 2.0001006 2.2501095 .0313053 .0020843 
m= 4: 2.0000366 2.2501095 .0318505 .0020843 
m= 5: 2.0000366 2. 2507520 .0318505 ~.0005721 
ay a> 
m= 3: 
m= 4; -.0005310 
m= 5; -.0005310 -.0020677 
As expected, approximations are closer to those of Prob. 6 than 
Prob. 5. 
(D) ms 0 1 2 3 4 
6: 19.013 .00118 .176x107> .24Bx107° 227x107 © 
a! .2377. 00017 .29x107° 500x107! 84x10"? 


Therefore, M = 3. 
11. (c) In 11a let v; = u,_, and u, = v;. 


12. (a) Set wip, (iN) = Aju, (4,8). (b) Apply summation by parts 
and use A?qy_, (i) =0. (c) Use first equality to find zeros of U5. 


13. (a) Calculate (1 + x) Ptm 


i-N 


m4. (a) (2 -N- 199) © 58 ¢ N Wo gy ph De 1-H), 


j 

(b) Convert to factorial notation; (c) Use result that d , = 
a - 5)" 713) (a) Show 654 94) 2 (54k) 15 sets (e) Byy * 

_a\J (3) 
(-1) “31N Asn 
15. (a) Express (j + x) 9), n'*) ana G _ y) in terms of factorials; 


(c) Use (6.4-29) and show limiting recurrence is that for 
Legendre polynomials. 
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16. 


18. 


19. 


20. 


22. 


23. 


24, 


(a) = ; p5 (8, ,2L). (b) Use the 


y,,(s) = 
m s=-]L, 


yeas 


12a) and sum 
12c) 


orthogonality property of the p,(i,N), then use (Prob. 


by parts. (c) Show c., 5s f2D 7G) 2 (5), and then use (Prob. 


to derive the first equality; for the second equality use 
34a) ) © (54a) 9470541) ana sum by parts. (4) 
sum in part b can be changed to j to N; then let k=#i-j and use 

result that (2j+k) ‘2J) = a(2j+ky (2541) (2541). (ey) In (6.4-21) 


w= and terms for s and -s cancel. (f) From (6.4-20) 


Show limits in 


2 
Bs =Q. 5°5- 1/ (a5 _ 1€5)° 
x = .3 x = .4 x = .5 x= .6 x = .7 
(a) 3.1193 4.2952 5.7495 7.5091 9.5786 
(b) 3.1174 4.2965 5.7509 7.5046 9.5814 
(c) 3.1180 4.2960 5.7500 7.5040 9.5820 
(a) x=.18s+.7; by=-- 3919, b,=.9478, bo=.4202, b,=.0238, b,=-.0001; 
then calculating a” get M=3 as best least-squares approximation. 
(b) xX 4 5 .6 .7 
, -4 -4 -5 -4 
Residual: -2.0x10 3.7x10 -4,.7x10 -3.4x10 
8 9 1.0 
1.9x1074 2.3x107" = =-1.0x1074 
-4 -4 -4 
(c) Trpo * 1.4x10 §, Trp ® 2.1x10 =, Sapo * 2.0x10 =, 
-4 
SAb3 * 1.5x10 §; T,(s) = -024p, (s) + -420p, (s) + -948p, (s) - -392p) 


therefore, all errors within expected limits. 


(s); 
(a) Py=t, Py=8/4, py= (387-20) /28, p= (5s°-59s) /84, 
2 


py= (78-1158 +216) /168, pe™(98°-1858°+716s) /240. 


(b) b,=6.97863, b,=2.53581, 


0 
be=.00003. (c) 


bo=.76201, b,=.08379, b,=.00240, 


3 
In powers of x, Yy (%) = 1.0000x'+2.9875x°+ 


2.0188x~+.9915x+5.00103 differences all explainable by roundoff. 


(a) If j = r (modN) each term is 1; otherwise see prob. 29. 
(c) For convolution use 
<7 ‘ “ ; 
a b, = a_b 
3-0 J 3z0 J 520 reo * JF 
(a) Only one non-zero term in summations. (b) G-c > g,~c6 (k). 
-1 
(c) g,=(1/N) fF G 
0 K=0 k 


(b) Let H =G rh 


Use convolution theorem. (c) 4 ™G), (hy =, « 
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26. 


27. 


28. 


30. 
31. 


32. 


33. 


34. 


35. 


36. 


37. 


(a) Show minimum of r/logor is for r=e; test 2 and 3. 
(b) Show values of Je fill in gaps of 2h=1 between sum of other 2 


terms. (c) Sines and cosines in all 4 quadrants can be expressed 
in terms of sines in first quadrant. 


t-L+1 t-£ 
Jp: q=pt2” “; f, (p)=f£,_, (qa); £,(q) = 


Kpot-1 
(£,_,(P)-£,_, (a)lw » (c) Replace w” 


(b) p=K , +2 


by cos(2ly/N) everywhere. 


(a) Ho=4-H,=3,H5=2,H4=1. (b) Use (6.6-25), (6.6-26) with 


negative exponents on w and multiply by 1/8; get Goul/2, Jo™9y=I6=0, 
g4=1/8-(14+72)i/8, J5=1/8+ (1-72) i/8, Jg=1/8- (1-72) 1/8, g5=1/8 + 
(1+7%2)i/8. (c) By exact calculation H =4,H,=3,Hj=2,H,=1,H,=0, 


H,=1,H¢=2,Hj=3. (d) Use fact that g_,=g, and note that sum in 
part a could go from 0 to 7 with G5.j=4,5,6,7 defined as in part b; 


then result follows directly from problem 23b. 

(a) Use formula for sum of geometric series; (b) take real and 
imaginary parts. 

(a) In (6.6-29) and (6.6-30) replace (2L+1)/2 by L and 2L+1 by 2L. 
(b) In (6.6-33), (6.6-34), and (6.6-35) replace 2L+1 by 2L and 2L 
by 2L-1. (c) a,=(1/L) ££; cos (ijm/L); sin Lx=0 at all Xi 


(a) Use identity sin nx=2sin (n-1)xcosx-sin(n-2)x. 


(b) Uy 5 AE yt2e08x Ui 44 3 Yn? 5° 


L 
(b) Consider sin 5 (x, -x) (5 + } cos J(x,-x], multiply out and get 
telescoping series. j=1 

2L-1 1 1 1 

(c) yy (x)= (1/22) 2, [5 sin L(x,-x)cos 5 (x,-x)/sin 5 (x, -x]f;. 
(a) Show (6.6-44) is exact when f(x)=1, write (6.6-44) for f(x)=1, 
multiply by f(x), and subtract from (6.6-44) as in text. (b) For 
x, near x denominator terms are smallest; sin (L+1/2) (x, -x) 


oscillates rapidly; if f(x) changes slowly there is cancellation 
for those x, not near x; even for rapidly changing f(x) terms with 


x, near x are most important. 


2 
(a) aj=(1/t) J, f(x) cos jx dx; use h=#24/(2L+1) and note that 


f(0)=f(2"); similarly for bs. (b) All derivative terms drop out 


because of periodicity, but this does not mean that error term is 
zero. 
(a) aj=-0003, a,=.0002, a,=.0000, a,=1.0002, b,=1.0000, bo=-.0002, 


b,=.0001.  (b) -1.5x1074, 2.8x107/, 1.5x107%, -1.6"107 4, 
-2.8%107°, 6.8x10->. (c) 1.69%107’, -6.6"107"; roundoff causes 


negative result from second form of (6.6-35). 
2L oo 2L ad 2L 
2 2L+1 . 
a, =) £, = A, + } A, } cos jx, + } B. J sin 
aL+T "0 iso i 2080" 521 4 iso 1 j=1 7) 40 
jx; i then use result of Prob. 30c. (b) Again use Prob. 30c. 
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38. (b) ce, =(1/22) pe F, (ye OT /2) a); then use expression for £(t) 
-2Q 
2miAX5) 


(d) Substitute series for F(A) in part b into inverse transform, 
interchange summation and integration and then use part c. 


as inverse Fourier transform. (c) Calculate _ P(iA)e 
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CHAPTER 7 


2. 


11. 


12. 


(a) Since a=0, consider qd, (2) /2- (b) Use q.(2)/(27+y") . 
(c) Show that on diameter qn (2) crosses real and imaginary axes 


appropriate number of times. (d) If not g,h have four roots 

tatbi with a,b#0. 

(a) Use y=x-a where a is common real part. (b) Sum of roots 
nonzero if not all have same real part. (c) Unless u(y) is 


identically zero, the roots can be shifted in the other direction 
or further in the same direction to give a,=0, B,=0 and b 170. 


(b) Form of v in (7.2-15) is y*+y+a when n is even. (c) At 
most 2 applications needed to achieve u, Cy) of odd degree. 


(a) Case (ii) choice of &=3 leads to bo=0; choose instead 6=2 or 4; 
with 6=2, a,=1, a5=1/3, a,=6, Bo=74/9, B,=16/3,; with 6=4, a,=9, 
a5=-3, a,=-6, Bo=-26, B,=12. (b) 6=7/6 for polynomial of part a, 


6=1/5 for polynomial of Example 7.1. 

(a) p=1/2(A-1); a=B'-2q with B'=B-p(p+1); r+s=#C'-q with C'=C-pB'; 
r=q7+D'q+D!! with D'=-B'+p, D''=D-pcC'; ~2q?-q? (1-B'+2D') -q (-B'D'+ 
2D''-C')+B'D' '-E=0. (b) Express P(x) in terms of a,b,c,d,e,f and 
match coefficients using equations of part a. (c) y=x(x+3), 

w= (y-7) (x+3) , P=(wty+15) (w+16)-27; this solution corresponds to root 


q=2 of 2q7?-8q7+2q+12=0; corresponding to two other real roots there 
are two other solutions. 


Show valid for n=2* for any k; then for ake n < okt! use induction; 
to get code word for r+1 from that for r start from right changing 
each SX to S until come to first S; change this to SX; then show 
new code word results in one higher power of x than old. 


By long division; then let Q4=by_5/P, 3 calculate 85's from 


k 
oP ase i-172r" r=m,...,k+1 with s5=0, j > m-k-1; calculate p;‘s 
k 
from a b,S 5 _4tPy- tap ra0,-+- rk with 8;=0, j < 0. (b) Dy=Pg. 
C4y=Py-P 944? degenerate case if C,=0. (c) 54.4 tP 5H dy JE1,..6, K-11; 
Py Gy? PoI5417PoP5* (Py -Po9 dy 45=P5 ,j=1 Geeoe ,k-1 ; (Py -P 944) 9, +P Py =P} i 
calculate Pye dye Pri! Wen! etc. 
m-k=~1 j 
(a) In place of Cx have x ( } s.x?). (c) m > k+1: k divisions, 
j=0 
m-k multiplications; m < k: m+1 divisions, k-m multiplications. 


(d) 24+3/[x-14+2/(x-2)]. 
(b) (7.3-5) with s=0 is cobytc,b,=0 but c,=0 and bo=1, Co=-1/2. 


Consider £(x)- [Pq )'%) (x)-+Dxegty 1) (x 1/ fay 1) 


pxQims tk) (xy) and show that D can be chosen so that this is 


» peg (M-1,k) 74 (m-1,k-1) | 3 (m-k-1) ,,(m-1,k-1) 
£(x)-Riy (x); Ded, /a,, ; E=dy. Jay . 


2 3 2 3 - 
(b) R33 (z)=(a ta, ztayz +a,z )/(1+b,2+boz +b,z ) where ag=l, 


a47-3665*10!/A,a,=19 ,197%x8!/A,a,=-14 ,615x6!/A,b»=229x101/A, 
b5=297x81/A,b,=127%61/A with A=59x12!. 


(x) + 
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13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


(c) Do = -14,615/127 =~ -115.1; rest of continued fraction must add 
to Do to get result less than 1 in magnitude. (d) do#45 ,469/ 


(38,940x14!). 
(a) b,=10! (229-768) /A, bo=8! (297-428) /d, b,=127%6!, 


a,=6!(-14,615+16,632&)/d where A=12! (59-34). 


(b) Cy = -E, Dy = (-2352E°+33,264E-14,615) /127. 
(c) do= (45 ,469+9336E) /[59-34E) 660x148). (da) E€ = +.45, +13.69; 
latter value gives smaller value of d.. 


(a) For the interior arguments use induction. (b) Any permutation 
of arguments can be changed back to Xqreeer Xy by interchanging 


interior arguments, end arguments, and first two arguments. 
(b) 


-3 -1.203973 


- 347606 
4 -.916291 1.073065 
~-4481741 1.769896 
-> —.693147 1.300036 2.288830 
- 548483 2.174429 2.535369 
-6 -.510826 1.484544 2.469300 3.116367 
~-648715 2.580621 2.923609 
-7 -.356675 1.639831 2.640794 
- 748890 2.980236 
8 -,223144 1.774279 
- 849019 


-9 -.105361 

(a) f(x) =£(x,)+(x-x,)/04 (XX) H£(X,) + (x- x1) J 10 (X41 XQ) + (K-KQ) / 

[05 (XX 0X5) -Op (x4) J etc. (b) When X=X; continued fraction from 
X-x, on is zero. (c) Convergents are -.693147, -.620219, 


-.616209; true value -.616186. 
(a) From definition of 9, (X15). (b) Last equality follows from 


definition of partial derivative. (c) 1/(8/8x) O54 (Xre +X) = 
(179) /(A/Ax) 044 (Xs ++ 2X) = JR{R;_ F(x). 

(a) Use Prob. 17c. (b) ro, (0)=(2n+1) ; Fong (0) =2/(nt1) ; conver- 
gents of ln .54: -.46, -.597403, -.612593, -.615714, (c) ro, (0)5 
(-1)9*1o, ¢ ,(0)=(-1)"(2n+1); convergents of e: 1,2,3,2.75, 
2.71143. 

(a) Show 97 (2) y(z) has degree n-1; show F(z) has n+1 zeros; take 


2n+ 


nth derivative of F(z) and use Rolle's theorem. 
(b) p(x) + (x-a)"; rest is unchanged. (c) In both cases approxima- 


tion has same functional value and same first n-1 derivatives as 
f(x) at x=0. 
(b) (m,k): (4,0) (3,1) (2,2) (1.3) (0,4) 
d.: 1/120 -1/480 1/720 -1/480 1/120 
(a) sin xt Re 4 (x)=x-x7/64x7/120, Ry 9 (x)= (x-7x°/60) /(14x7/20) , 


R, y (0) =x/ (14x27647x 9/360) § cos x: R (x) =1-x7/24x4724, 


4,0 
R - 2 2 2 4 
2,2 = (1-5x /12)/(1+x"/12), Roy (x)= 17 (14x /2+Sx /24). 


(b) Take remainder y of x/2m; if ye({1,21] consider y-21; then 
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reduce to interval [-(1/2),"/2]; if magnitude of y less than 1/4 
compute sin y; otherwise compute cos (1/2-y); then adjust sign; 
algorithm for cosine similar; the Pade approximation is more 
accurate the smaller the argument. (c) Magnitude of errors at 
tm/4: Re p(x): 3.6107: Ry g(x): 3.8x107 7? Ry g(x): 

e v 


4 4 


5,0 


- - -4 
33x10 7; Ry g(x): 3.2x10° °F Ry g(x): 4.5*10 77 Ro y (x): 


1.4x1072, 


22. (a) Ro a (x) =14127 (x-6412/x) . 
(b) Ryo =.83,1 22,2 21,3 Ro,a 


Cont. fract. (M/D): 3/0 2/1 0/2 2/2 3/1 
Rat. fnct. (M/D) : 3/0 3/1 2/1 3/1 3/1 


(c) (i) Ro 2 by continued fraction; (ii) Ro 2 by continued 


fraction or R,, 0? (iii) Ry 0° 
23. (a) Use y=1/2(x+1); TY (x)=T| (2x-1) =cos r@. (b) TE (x) = (4x-2) 
* —m* . * =-1- * —%vu1- * = 2_ . * - 3_ 2 
TY. (x) Tea (x) 3 Tp (x) 31; Ty (x) =2x-1; T. (x) 8x" -8xt+1; T, (x) 32x~-48x"+ 


5 4 3 


3 2 ~1280x4+1120x3- 


18x-1; 78 (x) =128x'-256x #160x°-32x+1; TE (x) =512x 


400x°+50x-1; TS (x) =2048x°-6 144x469 12x. 3584x°+640x"-72x+1. 


(c) T* (x) /2?77! has smallest magnitude on (0,1) of all polynomials 
of degree r with leading coefficient 1; proof same as for Theorem 
7.1. 

24. (a) Use recurrence relation and induction on i; consider only even 
i-j}. (b) Proceed as in part a. 

25. (a) Use (7.5-1). _ 


26. (a) Cy & 2.532132; Cy = 1.130318; C5 ~ .271498; Cy = .044382: 


Cc, # .006017. (b) Coefficients: 1.000584; .997172: .494859; 


4 
.177528; .048137; maximum error on [-1,1]: -1.0x1072 at -1. 


27. (a) Use change of variable x=sin 6@ and integrate term by term. 


(b) cy = 2.532118, c, = 1.130317, cy = «271484; c, = .044336, 
cy = .005469. (c) Coefficients: 1.000044; .997309; .499216; 
.177344; .043752; maximum error on [-1,1]: 6.2x10-" at +1. 


; = - k e e f—4 . 
28. (a) For sin x: Co, 20; Cong =e 1) Jog 1) 3 for cos x: Conny 0: 
Cy =2(-1) 855, (1): sin x # .999980x-. 166504x°+.008000x°; cos X zx 


.999958-. 499244x~+.039632x' smaximum error on [-(1/4),7/4]: sin x: 


2.8x107°; cos x: 4.2*107>. (b) T, > (x) =[.890848T, (x)-.027874T, 


(x) ]/ [14 0255567, (x) ]=(1.000027x-.114420x7) / (14.052452x7) ; 


maximum error on [-(1/4) ,17/4]: 3.9x107°, (c) T. 2 (x) =[.760245- 


1967147, (x) 1/[14.0431077, (x) }=(1.000069-.411152x7) /(1+.090098x~), 


maximum error on [-(1/4,17/4]: 6.9x107>, 


29. T; 1 (X) = (9998774, 75669 3x+. 25644 2x 7+. 042212x°) / (1-.243945x) ; 
a 


yy 


maximum error on [-1,1]: 3.1x10 =; T, 3 (x)= (1.000068+.247144x) / 


(1-.753525x+.252980x--.040644x-); maximum error on [-1,1}:4.2«1074. 


_ _ N+2 N+1 
30. (a) Cig (X)ERyy 2 9 (*) dg 20 Ty42 (X78) /2 . 
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31. 


32. 


33. 


34, 


35. 


36. 
37. 


(b) (x) =.999995x-. 166601x-+.008119x-; maximum error on 


C5 ,0 
[-(m/4) ,n/¥}: 6.3x1077.  (c) Cy 9 (x) =.999990-.499703x°+,040382x"; 
maximum error on [-(1/4),1/4): 1.1x107>. 

(a) Express numerator of £(x)-C_) (x) as sum of Q, (x) £ (x) -P. (x) 
and use analog of (7.7-9) with Q,(0)=1. (b) (1)C, 5 (x) = (228141128x+ 


180x7) /(2281-1152x+192x7) ; maximum error on [-1,1]: 1.4x1072 at +1. 


(ii) cy 5 (X) = (228141 152x+192x7) /(2281-1128x+180x7) ; maximum error 
3 


on [-1,1]: -1.1*%10 > at +1; choice of second and fourth rational 
functions doesn't matter since Y5=Y4=0- 


(a) Using Ro ,0° Ri 0° Ry 4° Roa? R34: Cay (x) = (19874112 8x4+ 


384x°4+64x>) /(1487-360x) ; using R, a" R 1: Cc, 3 6%) = (1487+360x) / 


2 


(1487-1128x+384x ~64x°) ; maximum errors on [-1,1]: C3 ee 


524x107; C, 3 (x): 1.9x107>, (b) Only requires computation of 
Y,7 not necessarily equal to C, 5 (x). (c) XC {= (144) x 
e v 


7x°/60}/(1+y ,+x°/20) where ¥4=30'/716 , 800 ; maximum error on 


[= (0/4) ,w/4]z 8.3x1078. (a) Cy (x2) = (14g 527/12) / (14 tx2/12) 


where Y=" /81,920; maximum error on [-(1/4) ,w/4): 1.2x10-/," 


(a) y=x-.505; (y) =(.710634+.831642y-.200409y*) /(1+. 180 184y+ 


Ro 2 
.029733y*) ; Maximum error for x € [.01,1]: -.172 at .01. 

(b) Cop (v)=(1.0435741 .31879y-. 2004 1y) /(1, 468504. 40869¥+.02973y°) ; 
maximum error for x e€ [.01,1]: —.168 at .01. 

(a) Use shifted Chebyshev polynomials. (b) Chg (x) =(1.211504+ 


-646267x+.125000%7) /(1.211507-.565104x+ .083333x~) ; maximum error on 


[0,1]: 1.2x10 at 1. 


(a) From differential equation get D+1 equations; from initial 
conditions n equations. (b) D-m+n rs "s give D+n+1 tofat unknowns. 


(c) ,sing L(y) oy "-y,y(0)=1 get y(x)= 141824%/1825+928x2 /1825+ 


256x 3718254128 471825; Maximum error on [0,1]: 9.9x10 3, 


Use partial fractions. 
(a) Choose Re? (x) such that ri artt/i. (b) Let PRE) (x) | < G 
on [a,b]; choose any m+1 points Gyre ee by in (a,b); then 


|P(e, Y/Q(E; i)| < G + max f(x); since Q normalized [P(es) | is 
{a,b} 
bounded; polynomial of degree m bounded at m+1 fixed points is such 


that all its coefficients are bounded; therefore, convergent sub- 
sequence exists. (c) R(x)= =£ (x) +R!) (4¢)-£ (x) +R (x) -R'?) (x); there- 
fore, |R| < f +[R) -e/4;per 1, R=lim R'?) except perhaps at 


io 
finite number of points where denominator is zero; except at these 
points |R| < max ftr, te, where €. 470 as i+~; therefore, |R(x)| is 


bounded except at a finite number of points: therefore, R'2) (x) 
converges uniformly to R. (d) max |f-R| < max | £-R, | +max [R, -R| <r, 
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therefore, r < r but r > also since r is assumed minimum. 


38. (a) Since A 
. (2) 
(b) Use properties of R to deduce that A(a,_,) and A(Q; 44) 


have same sign if k is even, different signs if k odd. 
(c) Apply result of part b. (d) Show degree of numerator of A 
no greater than Lo-2. (e) Suppose contrary, consider Sak 6%) “Ray (%) 


with a common denominator and count zeros of numerator. 


< 
= ce-r{2)y-e-R and a:s are extrema of f-R (2) 


39. (a) Each step of each proof is unchanged by the introduction of 
s(x). (b) Multiply rational functions in (7.9-2) and (7.9-3) by 
s(x). 


40. Best constant approximation is 1/2; show that no ratio of linear 
approximations could be better. 
41. (a) R, 1 (x) = (141/2x) / (1-1/2x) . (b) Cy 1 6%) = (17/1641/2x) / 
8 o 
(17/16-1/2x); extrema at -1.0, -.43, .59, 1.0. (c) Ri 4(x) & 
(1.017024+.517541x)/(1.0-.439785x) after two iterations; maximum 


error .020972. 
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CHAPTER 8 


1. 
2. 


10. 


11. 


12. 


13. 


14, 


15. 
16. 


(b) g'=1/y'; g''=-y' "Hy')33 g'atey''y'#3(y'") 27 0y'y>. 
(a) If C > 1 then a small error would tend to grow if order is 1; 


if order greater than 1 then le. 44! rw cle, |P will be smaller than 
le. | if le, | is sufficiently small even if C > 1; need “or equal" 
part for case where Esyqrii-tile,. (b) No; for any C, 


le. 44! > cle, |P if p < 1 and le, | sufficiently small. 


Set S,=p" in homogeneous part; get P=P;; then get particular 


solution. ' 

(a) 25 for f(x); 5 more for f (x); if evaluations of elementary 
functions are saved f (x) will usually be easy to calculate. 

(b) Since evaluation of f(x) is of no use for f (x) the relative 
costs are (n~1)/(n-2). 

(a) Assume contrary and use fact that sequence of iterates is 
bounded. (b) Use a geometric argument. 


(b) Plausibility because Youd * (1 + V5)y¥3/2. 
(a) (i) X= 6517747; (ii) X¢ 515201; (iii) Xe=-517744; 
(iv) X_=.517756; true value: .51775737. (b) X= 85705674; 


true value: .85705677. (c) Equation in y is y cot y=ln y-ln sin y; 
with Y,=1, Yo=2, Y7=1. 3372355, true value: 1.3372356, true value 


of x is .3181316. 


t c2ctt ',3 
egg eXy E/E pf pf, /2(E i) 
1/146, +6, 
(c) 3 ; good method when cost of computing second 
derivative is small. 
(a) Choose a,b so that x, -u,/Q(u,) subtracted from right hand side 


of answer to 8(b) agrees up to powers of uy? Q(u,)=1-u,f;'/2£,. 
(b) Halley: %4~-1.4142139; Problem 8b: X4=1.4142166. 
(a) E(x, E(x) + (K5 nx) E(x). (b) f(a)=f(x;)+(a-x,)f (x5) + 


(b) 


1/2(a-x,)7£"" (E). 


1+75)/2, 


(a) CysC£"" (a) /2£" (a) | Co=lf''(a)/2£'(a)}. (b) Get 


[ (1475) /2} 7227; r/seln 2/1n [(14+75) /2]. 

(a) (i) Xe=e51775740; (ii) Xy=~51775737. (b) Xg=- 85705679; 
X3=-85705679; for faster convergence in both cases try X4=.25. 
(a) Newton: Xq4=1-5882040; Halley: X4=1.5849805; true value: 


1.5848933. (b) Use f(x)=1/x-a for reciprocal of a; 
X447-0999966; Halley's method can't be used because it needs 


division. (c) Draw the graph of 1/x-10. 
(a) Use Theorem 8.1; 69) (ay=0, j=1,..., p-1 if U(a) is bounded; 
G'P) (a)¥0 if U(a)x-F 'P) (a) /pt. 
' 2 
gait li-cft (x,)]e,+0(e,). 
(a) Use second equation to compute (x, 


Show e 
t 

gar DE (x; )/(y;-a). (b) Use 

first equation and error in Newton-Raphson method. (c) Order is 

3; to get from x; to X44 need two function evaluations, one 

derivative; more efficient than Newton-Raphson if 6,<.71; more 


efficient than secant if 6,<.28. (d) Xy=-51775738. 


A35 


17. 


18. 


19. 


20. 
21. 


22. 
23. 


24. 
25. 


26. 
27. 


28. 


29. 


30. 


31. 


(a) See Prob. 43, Chap. 4. (b) Solve Van eV %yeatYi_2° 
3 _2 . 
YorYy=Yo™li positive root of r°-r°-r-1=0 is 1.839. (c) 2.732. 


(a) x54 = CE (yg) £ Oey 9) JC LE (oe, -£ (oe; _ 4) DE Oe) Oe; 9) 
+(£ (x; ) £ (x; yarn 1) ~£ (x5) 1 CE Ox 4) “£ (x; D1} x, 1 
+(£ (x, ) £ (x, 4) / CLE (x, - € (x, TECK, _2)-f£ (x; _,) 13) x;_>- 
)} 2x. 


(b) i-1 i 


Keg Tlt/ Le (x; )-£ (x5) ] PH 13E (x, )-£ Ox, pI (x; 
~[3£ (x; _ 4) -£ (x, )) [£057] x, _ Fe CLE (xg) £0244 4) 1 LE Org) £05) 17 
[£(x, _)/t" (x. )+E (x, Ee" (x, 1)J- (c) Part a with x_470, Xoul, 
x,=2, get x= 451775736; part. b with X90, X4=1, get Xe=.51775733. 
(a) Since f' (x)#0, f(x) has different signs at Xpone Xi 
therefore, polynomial which goes through £ (xi 5). j=0,...,n must 


have zero in interval but need not be unique. (b) Set x=a in 

(8.5-1) and use mean-value theorem. 

(a) Use a two-point Lagrangian formula; (b) Use a one-point 

Hermite formula. 

(a) In three-point Lagrangian formula set x=x, i+] and solve for 


(x5 447%) 7 (x, -x3_4)- (ec) Same difference equation as in Prob. 17b; 
order is 1.84, (d) With x_,=0, x 9=1, x,=2, get x_=.51775738. 
(c) (8.5-25) is of form (8.5-8) with n=2, r,=ro=1. 

] t 
(a) x, ie17% ivf /t. where f. pm (2%, - Xs _ yy 9) Fy (X53 -% yy) (% XG) + 
(Xp—Xs a) Ep gH ey yoy) Oy Xi) One) £5 Oy _ 97%) 
(x X97 XL 1° (b) If Fy is Newton-Raphson and Fe is function of 

te eee 

part 3 then €,,,=a-F,=1/2f (é)e; */£; ~£,£, 0 (%,-x,_ 4) (XE-X_ 9) 
6f.f.; dominant term is in e. i£4-1£4-2° 
(c) With x_,=0, Xoely x y72- get X9=.51775738; no square roots 


needed. 
(c) Use induction. (ad) If r#1, (1/r-1) 4470. 


(a) With m=0 in (8.6-10) get e€ ~ (1-1/r)e,. (b) Xpore5l21, 
Xo97-51766. 

(a) X4=-51775737; (b) Xe=-51775737. 

(a) £(x) has same sign on both sides of root. (b) Suppose Xo and 


are such that £(xp)=£(x,). 


“xX. 


1 


i+1 


x4 


2 . . 
Ki gerXy/ 1-7 therefore x <0, i=2,3,...3 but magnitude of Xi 44 
less than that of Xj° 
(a) X54 Xi SF (xy) mx; which approaches zero as xX, 7a. 
(DY epg y-xy lly yy l- 
=19x. ,/20+1/20x; 9, show that if x, > 1 then Xiang < % and 


*i+d 1 
Xs 44 > 1. (b) X5=-75, “x401.125, Xg=1.0000039. 
2 
(a) Compute a- Xiggr A-Xj 443 then from E,gou(1-t/r)e; {+0 (€;) get 


Cs go/Es ay z 1-1/r. (b) Five iterations are enough to determine 
r=3; then using (8.6-13) with x,=0 get X4=-51775738. 


A36 


32. 


33. 


34. 
35. 


37. 


38. 


39. 


40O. 


41, 


43. 
Qu, 


4S. 


46. 


. N 
ax) sax ag. .= J ag, /ay'*)-ae, sax"), 
Jk p2, 73 


(a) Use Pit) jx) py py 2) pay) -£ Ex CW) (2), i=1,2. 
(1) (1) (2) _) (2) 


i471 ~%y 0 and X54 -x; 


(b) Get two simultaneous equations for x. 
(c) 1/2a° g(E-y,); components have form of products of two 
components of Y3i therefore, components of €. i+] have form of sum 


of products of components of Es. (d) Definition: Order p if lim 
le, ,,!Zle, [Pac#o; use result of part c to get order of 2 for 


(8.8-12). 
e* cos y=X; e* sin y=y; xi) =.3181316, 5) =1-3372354. 
(a) Does not converge. (b) Converges to Xy=-35725, Yy=2-78575. 


(a) (x,A Ax) = (Ax, Ax) 


-1 1 


(b)  (D(A))~'=diag ((d,+)7',..., (a 4a) 71) 


(c) Let AT£ag. (H(A) E,H(A)E) = J (a,4a) 7292 
ig fe =) <b i Si 
dg  % ox Ig OF ; 
ae = ox ae * st ~ ox = x! (t)+e_ te £(x,)=J (x) x (t)+£ (x) e 


_ _ _ (1) 
(a) z=axt+tby+c where a=D ,/D ,b=D,. /D,c=D °/P with D an lfs_ KYG- geile 


¢ fl) 


(a) O= 


D second 


oT a Vuk Fpag ls Deby epee: 
plane similar replacing superscript 1 by 2 to get z=dx+eytf. 
(b) X;,4= (cd-af) /(ae-bd), Ys 44> (b£-ce) / (ae-bd) ; fails if points 


are collinear. (c) After 10 iterations x=.317692, y=1.338849; 
there is no always convergent analog of false position in more 
than one dimension. 


(a) Wott, =I, Wf tm, £5 1 =0; solve to get secant. 


=(G ~a)+1/2 (x5 -a) * 20 (x -a=In 


-a) and Xiad : 


(b) fy (xs. 4) ki * 541 tt 


=- = - . t 
(X55 a). (c) £, (xj5) (Gy Xs -4 ~a)+ (XK, 370) TOy (5. -5 ~a); can't 
infer quadratic convergence because nehavior of Ts "s is unknown. 
(d) After five iterations x=.3181320, y=1.3372355; rule is 
reasonable because in a sense it removes worst approximation. 
(a) Draw a picture. (b) After four iterations x=.3181315, 


y=1.3372356. 
Use partial fractions. 
(a) fi, (*) is greatest common divisor of f and f'; therefore, find 


multiplicity of zero in fn (x) and add 1. (b) If f(a)=0, f (ate) 


has same sign as f' (a); if f(b)=0, f(b-e€) has opposite sign of 
f' (b); if zeros are Multiple then each f; (x)=0. 


(a) Yes; Cauchy index, V(a) and V(b) unchanged. (b) Sturm if 
£1 (*) does not vanish; if £4 6%) does vanish generalized Sturm 


sequence with P(x)=£, (x); for rational function let numerator and 


denominator be f(x) and g(x), respectively. 
(a) 1 positive, 1 negative; (b) 1. 
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47. 


48, 


49, 


50. 


51. 
52. 


53. 
54. 


55. 


56. 


57. 


59. 


60. 


(b) If overwriting of the coefficients of P(z) is permitted, p 


additional locations for Xpeee XP are required while for iterated 
synthetic division, one more location (for x) is needed. 


=(z) j-1 n-j j-1 
(a) f£(z)=(z FAS _ Zz +...tAg) (b,_ 52 +..-+Do) +Bs_ z +...+ 


1 1 


Boi By As Pg HAs Peg am es AQP g Ken- dr. +2 10: 

Drage TP paege gts + =P yTO BMA, WA,by-Ay_ pDy-+ + “AQD, 

(b) Zeros are +1,+1,+1,-1,-1. 

Accepted Zero Deflated Polynomial P. (x) Zeros of P. (x) 
-.981230(4)  x7+.880000x-.637440 (2) .755610(1) ,-.843610(1) 
~.873413 x7+.981231 (4) x+.883867 ~.900774 (-4) ,-.981231 (4) 
~.912156(-4) x7+.981318(4)x+.857018 ~.873412 ~.981231(4) 

Accepted Zero Deflated Polynomial Po (x) 
~.981230 (4) x°+.873504x+.7966901 (-4) 
~.873413 1.01464x~+.981229 (4) x+.895036 
~.912156 (-4) 190214 (7) x7+.963967 (4) x+. 857020 (4) 


Zeros of P. (x) 
-.912158 (-4) ,-.873412 
-.912158(-4) ,-.967070 (4) 

Complex 


(a) g(z)=2"£ (1/z). 
. i _ t . _ t, 
Let Zo 4X5 ythys_y and b, =a, 4, a ba with b)=y,+i6,; then 


= - — ° = ' - 
Vy 175 M1 WY 5 Od 1 1 OKA G ce FY jad Me 8%5-1 41 775-19 


t 

k+1! 
' ' t _ _+! 

8X51 Oe 1 t¥ 51% Kad” R3=Pige Ry-y=P4- 

Doesn't converge; expect sensitivity to initial approximation for 


reasons given in Sec. 8.8. 
True value: .2878155+1.4160932i (see Prob. 64e). 


r r r r 
(r)_ _4\n-j 2 2)_)_4,n-4, 2 2 
(b) ay ( 1) Sn-j (Ayres an) =( 1) Eo; Ke MOS e 
1 n-j r 
(c) Magnitude of dominant term in sir) is (D4++-05) when 


(a) Long division. (c) Equate coefficients of like powers of z 
in part b. (d) Since u=s_. 

m ~m 
(a) Add terms of the form c, kat, j=0,..-,m; where m; is the 


multiplicity of a5. (b) In (8.10-43) factor out all terms in a,- 


(a) Using U4y to Yiy? B,=2.23607, $,=.46365. (b) Using Ug7 to 
W400" B,=1.54980, $,=--58205. 


(a) For (8.10-51), see Prob. 56b; for (8.10-51) consider 
derivative of f{x)/f(x); (8.10-59)-(8.10-64) verifiable by 
algebraic manipulations. 


A38 


n 
(b) Use induction to prove that n } a? - (} a,)" is 

i=1 i=1 
nonnegative for any set of real a, ‘Ss. (d) By construction $ (x) 


(1), 


' 
f (x; ) we assure that X54? Xy and that Ixi44 


has one zero in (x,x 1) ) if x<x by choosing sign to agree with 


-x;| has minimum 
magnitude; argument for x) cx is similar. (e) (i) X6=1.5000000, 


true value: 1.5; (ii) Xe=—1.2878155+.8578968i which is true value 


to seven decimals; other pairs of zeros given in Prob. 58. 


(A) 


P oc 
62. (b) nidtt) 2 y “1 _ . yse (8.11-3) and (8.11-4). 
Oo j=1 A5-S) 


63. (a) dz = 5x - i dy e 


64. (a) 2.0945515, -1.0472757 + 1.1359399i; (b) 15.988265, 
-4.273434, .439079; (c) 1.4476230, -2.0523004, .3023387 + 
-4951598i; (a) .8714384, -.2492279, -.3111053 + 2.12309831; 
(e) .0446927 + .3633450i, -1.2393991 + .62708351, 1.1947065 
+ 1.5621069i. 


65. (a) 20.104687; (b) 7.6132307; (c) 2.1. 
66. (b) Because |e| as given by (8.13-5) will not be small. 


(c) 2 =1, Jel<to-7"; 8 


; 22-5, lel x 6x10" 3 
=-15, Je] ~ 200. 


: Zo=-8, le| = 7x10 ~; 


29 
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CHAPTER 9 


1. Any time ali7')=0 interchange rows and columns until nonzero 
element in (i,i) position; iteration terminates when this cannot 
be done or after n-1 stages; at this point status of right-hand 
sides and number of nonzero coefficients in last nonzero equation 
give results of theorem and corollary. 

2. (a) Sparse if orthogonal polynomials are used; filled otherwise. 
(6) Number of equations equals number of points in mesh, but 
equation for each point involves only a few other points of mesh. 


3. (a) Use induction. (b) X4=X5=1. (c) Determinant has constant 
magnitude 1 but elements of matrix grow. (d) e¢ = .018, 
xX, = -159.2, Xo = 100; € = .02, xX, = 18.8, Xo = -10; e€ = 1/55, 


no solution. 
n-1 n-1 2 
4. For (9.3-13) use } k=(n-1)n/2; J k“=n(n-1) (2n-1)/6; for (9.3-14) 
1 1 


n-1 
M=n-1 + J) (n-1) (n-k+1). 
k=1 
6. L Le = Ie. 
7. (a) re eer se suppose i < j; if A,B lower triangular show 
.. = 0. 
1) 
(b) kj = Bag! k <i, = —™ A 55 + dias k > i 
8. (a) Let B = Pk yx, vi B and B = L. Pk ir B. Then De s=D =P... t <1 
t3 = -m 5 D5 + Des = Bese t > 0, t # k, ry 
b,. =-m.b.. +b..=b,. 
kj r,i 13 r.3 kj 
bo. =-m + b. = b 
ry ki “lj kj J 
9. (a) Induction on k (fix m, 1_4)- 
_ .(r-1) (r-1) . 
(b) Recall that mM. = ayy [ay , il>r. 
n°--n . . . n°-n ,..._. 
11. (b) z multiplications, > divisions. 
(c) Same. 
12. (a) qd; = Lege ass = £5 5/4;, re = & 5/4; - 
_ ~_= leet =z _ oT _ 
(Dy Syed = ned Enea Bnet! Gn = Ann 7 Sn-1 Pn-a En-1 
where An is given by (9.3-41). 
a. a. 
13. (b) af?) = ass - —t ay = a. - it any = all. 
3 J ll J J 11 J 
(da) X4=4.5856, X5=-.6312, X4=2.7345; true solution: 
X4=-458669, Xo=--63152, X3=2.73520. 
15. = (a) X4=-.1927, Xo=-- 1.6729, X4=2.2328, X,=- 4470; true solution: 
X4=-.18192, X5=- 1.66303, X,=2.21723, x, =- 44670. 
(b) X,=-. 1817, X5=- 1.6630, X,=2.2170, X,=- 4468. 
_ —am! . = =1- = i>j7°: = 
16. (a) Let B=(b;.]=A, A,i show bis 1, ifn, ban 1; bi, 0, i>j; Dis 0, 


A40 


17. 


18. 


19. 


20. 


21. 
23. 


| contains no large 


i<j#n; b, =0. (c) A, is normalized and A, 
elements. (da) Use By 47A3,t1/2E3, 34 where E31,31 has 1 in (31,31) 


position and zero elsewhere; then calculate B3 4334: differences in 


B3i and AS} are as large as 273, (e) At each stage only elements 


of last column are changed. (f) Show each stage of elimination 

doubles (31,31) element. (g) At second stage largest element is 
in last column. 

(b) If there exist Xoreee eX, such that right-hand side of part a 


is negative, then choose X such that second term on left is zero. 


(c) aft) = a, jas 4/8443 also largest element of positive definite 


matrix lies on diagonal. (dad) Since each matrix is positive 


. _ -6 = = =3 _ 
definite. (e) a4 4=2%10 , a4 a1 1.3x10 pAoo=-9. 


(a) True solution: X4=-73152, X5=-2.18886, X3=--65439. 
(b) True solution: X4=1.95539, X5=.10568, X3=1.45557, 


X y= 94924, 


(b) [Al], =3) HIATT = 4 


t t=—1 1 1 
() Hae) HATH eb eh 


(g) x = (.5000, .4995, 1.500)" 
(h) x = (1.000, .4990, 1.501)" 
(i) x' = (.5000, -499.5, -1500)7 
~ ai-ta a 
(a) y, = (b, - a his Y5)/25 3 i=1,...,n 
n a 
x, = Y; “ seher AS x i=n,...,71 
i-1 
(b) y, = (b, - ay ts Ys)/ks ue i=1,...,n 
n 
x = (yy va bes xs)/hi ae i=n,...,1 
~ ist 
(C) yu=bp- J tis ys i=1,...,n 
j=1 
n 
x; = ¥,/d, - Wig Xs isn,...,1 where D=diag (d,,...,/d,) 
j=1+1 
i-1 
(a) y, =b, - } i3 5 i=1,...,n 
j=1 
n 
x; = y;/4; “sabes Lr Xs i=n,...,1- 
2 
(a) [tl] < TI 


(c) All leading submatrices are diagonally dominant and hence 
nonsingular. 


A4li 


24, 


25. 


27. 


28. 


29. 


30. 


32. 


33. 


34, 


ElafP is 3 laQP telat 2 imsl < 2 
fo) hy Hag TS ky Hag Mt TAag! oy mag! Sh 
las sl+laysPC-Im;,]) 
; (1) 
= : - . ‘ < 2 - ee ‘ < . 
iby lay, | Im, slay, la;,| Im sllay5! s a5 


where the prime on y denotes that the term with i=j is excluded. 


u.. e 
4 ij7j - 
(b) 4 ____ J=it+t (1+n.), i=n,...,1, In, | < 2 p 


* 


A 


diag (255 e€.) 


(da) i 


+ 
fo 
t 
K 
| 
on 
Oo 
t 
It 


diag (n;) 
6L U + L 6U + 6L 6U 


. te, &.. ur. tn. L.. 
ii ij 3 ij 


® >t» G> ct > 
+ 
oO 
&> 
|x 
N 
K 
oO 
CG 
it 


wy 

ad 
> 

Wd) 
- 


(e) [BII, < [1Bll_ = (o(n-1) + 4 (n?-2ne2) 317? 


where B = E or E. 


(a) X4 = 0.3333 Xo = 1.000 


(b) x, = 1.000 Xo = 2.000 


1 
(c) Residual from part a r, = -00010451 ro = .00005226 


- 7 (tea -2.883, 


(d) A = 10 ( ° 
-2.161 4.3235 
= X5 = X3 = Xy = 1. 


(a) X4 


(a) Reciprocal: x =I-AB, ,,=I-AB, (I+C;) 
=C, (I-AB,)=C?. (c) For then c,+0 and By7a'. (d) B,b+Aq'b= 
b=(31/32, 63/32, 31/32): need for matrix products at each 


=x, (2-ax,). (b) C. 


+1 i+1 


Xi? Bs 
stage is bad computationally. 


(a) Solve for Diy in the ith equation. 


(b) Let Bea-\I; [a,.-Al < } fa.,|. 
Li k#i ik 


(c) Since B=-(L+U) diagonal terms of B are zero. 
(b) For second part use € =X, —Xy in definition of Figg fC) Consider 


change from €; to C544 component by component; show each change 


i+1 
which converges to zero; therefore, Q(e,) > 0 for any Ey: 
(a) (I-L) 7" '=a+n4L7+... . (bb) By=-L-U; Bye-(I+L) 7! 
from part a. 
(a) Jacobi: p=1/4;: Gauss-Seidel: p=1/16. (b) Result is (19/20, 
5/2,19/20); iteration for second component is not converaing 
monotonically yet. (c) Result is (1,2,1) and is just fortuitous. 


reduces Q by ~trid 7a, (da) Q's form a decreasing sequence 


U is nonnegative 


A42 


35. 


37. 


38. 


39. 


41. 


42. 


43. 


44, 


45. 


46. 


a7. 


48, 
49. 


50. 


(a) .999991. (b) x '")=.39473, x'2)=.62470. (c) after first 


iteration still have x). 33116, x2) =, 70000: reason is severe 


ill condition of matrix. 
(a) B,=(1+w,)B-w 1; C,=(1+w,)C. (b) o[K, (B)]= max d (K; (B) 1; 


since K ; (B) is sum of powers of B get result. (da) K; (1) =1; 
Ky (xdem (x5) /m (1-75) 5 want K, (x) to be minimum on [Xp eX4)i 


therefore, make change of variable to [-1,1] and use Chebvshev 
polynomials. (e) Since zeros of K, (x) are Y5 ‘Ss. (f) Must 


recompute u's again in order to minimize o(K,,,(B)]. 


(a) x= (15/16, 15/16, 15/16, 15/16) ; (b) xy= (127/128,127/128, 255/256, 


255/256); (c) (1,1,1,1). 
(a) p(B)=1/2; (b) o(B)=1/4. (c) Yes, Gauss-Seidel converged 


more rapidly. 


-1j } e e 
a L t U = e e ? e -u. e = 1: . e .=0, k= -1, 2,0 e , 1 e 
(a) e [sis] S54454 aa $Ki4G5 j 5 
~1 n k-1 
(b) For L , k-j multiplications for each Typsi yf} (k-5) 
J ok=2 j=1 
1 1 


= 1/6n> + o(n*), similarly for U., L~ uy! requires 1/3n> + O(n). 


(a) xO) x pean ty, (b) 1/3n? for triangular decomposition; 
to solve Ly=e, note that last n-i components of y are zero; reallv 


have ixi triangular system which requires 1/24 7+0 (i) operations; 
so all Ly=e, solutions require 1/6n>; Ux=y requires n(1/2n"); total 
is n?, (c) Each e, in part a may be considered a column of I. 
-1 , -1 
(b) Let U =(s,5]; start with last column of A; Asn *Sin' 


j=1,...,n; given columns j+1,...,n get column j. 


(c) 1/3n? for LU; 1/6n> for u!; 1/2n? for algorithm of part b; 


n? total. 
t -1 n -1 n -4 
DE, Manteg TO EG DS Mam 1 bs Aik *ky = 0, 2? 3. 
1 


(c) Calculate last row of A, last column, next to last row, next 
to last column, etc. 
A=LL_, a lacu7!yTzy7!; 1/6n> for L, 1/6n°> for L! 


wy ty7!, 1/2n2 total. 


, 1/6n° for 


_yawl T -1 en nn! 
(b) Cy yyalC, + Ua Maz) °C) Co=D + then Ci=B . 
(k+1) 


(a) Use induction on k; for k+1 and i and j < k, re =0 from 


definition; for i=k, ast!) = ays - an = 0; similarly for 

a (k+l) | (b) Must show b..=6.. + ult) yO) (c) C,=I; as 
ik ij ij k=4 k k ° 07’ 

before get C_ = B |, 

Inverse by rows: -.16194, 1.00902, -.10784, .51468, -2.03801, 

-2.06231, 1.10330, .10080, .44699. 

Upper triangle of inverse by rows: -2.22291, .20574, -1.03891, 
2.05389, 1.38996, .21704, -.43611, .68163, .73332, -.80208. 

If x # 0 is eigenvector corresponding to eigenvalue A, then 


0< x! Ax = MEESTER in positive definite case. 


A43 


51. 


52. 


53. 


57. 


58. 


59. 


60. 


61. 
63. 
66. 


68. 


(b) [laxl], = 0 if and only if Ax = 0. x # 0 implies that 
columns of A are linearly dependent. 
(a) Use (1.3-22) and (1.3-23) since Br B is symmetric. 

3 
Using the algorithm of Prob. 55, 2h + 5 n* = 3 n multiplications, 
2n divisions and n square roots are required. 


m ~2n 
c= (0,272 ,/0, pera d)i, t = m+2n, t, = m+1, s, 


a3 = (Mg 04000041 ,0 400040 ,-1,0 4464 ,0)> 1 in position m+i, -1 in 
position m+tn+i. 
(a) minimize Y¢ 


=s=n, 


subject to y,+x,; y, + xt Y3 + x7¥4 + xs Y5g ~ Y¥6 £ a 
i=1,...,9 
2 4 5 

“¥y — *i¥2 7 *G¥3 7 %iYy 7 *G¥5 7 Ye S ~Fy 
i=i,...,9 

Y6 2 0 

23 

(b) minimize y Y, 
k=6 


; 2 3 5 
subject to y, + X; Yo + X; Y3 + Xi Yy + *} Yo + Yug5 ~ Ygady 


= f, i=1,...,9 
Y, 2 0, k = 6,...,23 
(c) Assume that xy is a local, but not global, minimum of f(x) and 
let X5 be a feasible solution such that £(x5) < f(x,). Then 
£f(ax, + (1-a)x,) < £(x,) for all 5, 0<a< 1. 
(a) Let Xqrece eX be the positive components of x and denote by 


k 


a., the 3° column of A. Then ) x, a. 
+4 ja. 9 73 


vector y = (Yq rere r¥y 204-4470) with at least one Y5 > 0 such that 


= b and there exists a 


k 
) Y; a; = 0. x - ey, € > 0, is a feasible solution for small 
j=1 


values of e€. 
(c) For sufficiently small « > 0, x + € y is a feasible solution. 
x = (0,4/7, 12/7, 0, 0)°. 

-.45492, -—.24053(-2), -.37563, -.27161, .20928(-7), .10002(-7). 
Let Ax = e where A is represented by the vectors a,b and c as in 
the text. Initially store in d, the values d. given by (9.3=-53). 
For i=1,...,n-1 do if b,/a; > Qs 44/4544 

ist) (Cg ePigg)s (ezee 


+ 0 endif m + as,4/d;. 


) 


then da. + 0 else swap (b,,a 

Gia * dis qd; ~ Cigae Cia 

Digg tO ueaT™M Spe Seng * Sega MIG Cgay 

Follow by back substitution. 

5n-6 additions and multiplications, 4n-3 divisions. 
. . (r-1) 

<i, i<r. las; | 

(r=1) 1 

ij 


i+1 


+ e..,47- . endfor. 
fier ™ &y Spero 


(a) la, 1, i> Xx. 


JA 


<2,ic¢r. fa 1 otherwise 


JA 


A44 


w w,7(r-1+)3) 


(c) Jase | <1,r<jw,<2', r> 5-1, <2 


Bound on [aly | < bound on Jags) | for all i and r. 
69. (b) For r = n,n-1,...,1 py = 1 
n 
= -_ g e 1 eee 
rj rj eat UrrK kj r J ’ 70 

} 

u. =(a - u., 2, )/L__,i=r+1,...,n . 

ir ir k=r+1 1k kr rr 


A45 


otherwise. 


CHAPTER 10 


T.. = y ole = y ot 
1. (a) Show Y5Ax; = AGY5%) AsYaXy- 
2. (a) Show aXx, = Mex, for 10.1; for 10.2 consider xFAx, and its 


conjugate transpose. (b) f(A) has eigenvalues F(A,); that is, all 


zeros; only summetric matrix with all zero eigenvalues is null 
Matrix. (c) £(A)x,=0 where x; is any eigenvector; thus for any 


x,f(A)x=0 since eigenvectors span space. 


3. From (8.10-51) given u,,..., u, get a. _yreeey ay successively. 
4, (b) Consider Emo a omg: (c) Show each step does not affect 
zeros from previous steps. (d) Use induction. (e) Postmultipli- 


cation by M. requires nj multiplications; premultiplication by 


j-1 
M5 also requires nj; Ms itself requires n; sum from j=1 to n-1. 


(f) Exchange (k-1)st and jth rows and columns. (g) Express 
characteristic equation as product of two determinants, one 
already in form B. 

5. (a) Domain covered by circles with centers at x=2, 1, -1 and 
radii 5, 2, 4, respectively. (b) Centers at 2, 1, -1, radii 2, 5, 
4, (c) By rows B is 2, -1, 3/2; 2, 1, 1; 2, 3, -1; centers 2, 1, 
-1; radii 4, 4, 2-1/2. (d) n-m*n-m principal minor unchanged; 
lower right mxm minor unchanged; rest of lower triangle multiplied 


by a, upper triangle multipled by 1/a. (e) a = v5. 


6. (a) Compute diagonal elements of p'ap. (b) Compute diagonal 


elements of TT*. 


7. (a) Consider linear transformations represented by matrices. 


(c) B= paP’'; r(B) < r(aP!) < r(A) but also r(A) < r(B). 
1 


8. (a) Consider (A-A,I)x=0; let y=P" x where P-' is such that Pap 
is Jordan canonical matrix; then (J-A, I) y=0; rank of J-A 51 isn 
less the number of different elementary divisors in which ds 
appears. (b) Change variable to get (I-A, 1) "y20; let vth component 
of y be nonzero; satisfies equation but (J-),T) y # 0. 


- Jy =(a- Jia- v-j 
(c) (A d; 1) Xs (A dT) (A A; T) x 


Vv 
(ad) Consider (A-a, 1) Yo! times linear combination of Xs 


9. (a) (10.2-1) not possible since there are not n eigenvectors. 
(b) A=4 is double eigenvalue with one eigenvector. 
(c) After 10 iterations A z 4.41; convergence is very slow; after 
50 iterations A = 4.08. 

10. (a) No, Vo must also have component in direction of other eigen- 


vectors; no, merely need some nonzero component in direction of 
another eigenvector. (b) Keep taking Yo orthogonal to previously 


found eigenvectors; may fail if no component in direction of some 
eigenvector. (c) Will get new eigenvalue and eigenvector. 
(a) A=8; xf = (1,1,1) x> = (1,7,10). 

11. (a) Characteristic equation of B given in Prob. 4d; with Un 
defined as in (8.10-47) and with components of x initial values for 
(8.10-47), first component of px, is Untk’ (b) Squaring companion 


matrix squares eigenvalues. 
12. (a) F(A;) where A,'s are eigenvalues of A. (b) If [£0,) | is 


. . m m _ 
dominant, lim [f (A)] Vo/ £0311 = OsX4> 


A46 


13. 


14, 


15. 
16. 


17. 


18. 


19. 


21. 


23. 


24, 


25. 


26. 


27. 


28. 


(c) £(1,) is large but series for 1/(x-y) converges slowly. 


(d) This is equivalent to using first k+1 terms of expansion of 
xKtT 7 (x=). 

(a) Using A°+2A+41 after 15 iterations get 15.5071 where for true 
eigenvalue 2 + 2d + 4 = 15.5070. (b) In Wilkinson's method k=1. 
(a) eA! where dy is the largest eigenvalue algebraically; en 
where Ay is the smallest eigenvalue algebraically. (b) If result 
is less than 1 then de is positive and matrix is positive definite. 


(b) 1.4801215; result is fortuitous. 
(a) Want to maximize ratio of A,-P to eigenvalue of second 


greatest magnitude. (b) .732. 


Because of the 2m in (10.2-16) as opposed to the m in 
(10.2-11). 
Eigenvalue Eigenvector 
A, 9.6055509 (1.0, .6056, -—.3944 
A, 8.86988 (-.6043, 1.0, .1509) 


(a) A,:m=8 gives 9.6055502, whereas tenth iterate is 9.6055530; 


§? process also applicable to A,, By, B3-: (b) Converges to 
7.6055509 in 11 instead of 13 iterations. 


_ ' 2 ' 2 
(a) usy, = d, + } E541 + J esk 
(b) Ys x, = 0, J #k. 

Method must converge since off-diagonal norm is monotonically 
decreasing and bounded below by zero; prove converqence to diagonal 
Matrix by assuming contrary; obtain contradiction by showing that 
sequence of off-diagonal norms cannot converge to nonzero quantity 
if largest off-diagonal element chosen at each stage; sequence of 
norms could conceivably converge to nonzero limit for some choice of 
sequence of nonzero elements. 


(a) Average of square of off-diagonal elements is Vo/n (n-1) > Y;: 


(b) Each annihilation reduces square of off-diagonal norm by at 
least 2y*. (c) Use induction and result of part b. (da) Final 


2. p27 (n) (n-1 
off-diagonal norm v, ¢ n(n-1)yY_ < Yo: 


F 
Y,* -54; one annihilation at first stage after which diagonal 
elements are 2.0, 0.0, 2.0; Yo * -18; one annihilation; diagonal 
elements 2.530, 0.0, 1.470; Y3 * -06; two annihilations; diagonal 
elements 2.53649, -.0166472, 1.48016. 
(a) Use (10.3-12) to show t43 and tas 
transformation. (b) In general show tyie j=1,...,i-1 remain zero. 
(a) Use induction. (c) £ (4) is a constant; if f,-,) = 0 then 
fiigeg 0) = eff, gg (A). (A) £9 0A) = (-bg) (by) = 09: 2 y=byi 
£ a2 (A) is negative at r, and positive at + ©. 


remain zero during (3,i,2) 


(a) If x eigenvector of A, y of B then x = S,---S,y- 
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29. 


30. 


31. 


32. 


33. 


34, 


35. 


36. 


(by net x!Pat; x Pacrp,y/eys xO 2(-1/e,) fe, 4x7") + 


(b,-A)x"2)]; i=2,...,n-1.  (c) From Theorem 10.14 it follows that, 
if d, is multiple, then some c;=0; get eigenvector from each of 
smaller tridiagonal matrices and set other components equal to zero. 


(b) Show first that Pra, 1 has same first k-1 rows as Ay 4° 
2 (3) ,2 : 
(c) J tvyl* = (1/(2¥S))] {YS tq.) + (1/4 ta, .)) 2 
j=k , , j=k+1 


(4,5) = [1/(2vS)]} (1/ (VS + Fy-1,K)) (28 + 29,4 YS) i compute 


, 5 jak+1 =0, s< k; -2v{%)y 9) 
rer wet Pye 1,r3r sPs3' 5 peee oN? use Ps; , Ss ; Vy. YR 
S?k#j; 1n2v, Dy 9), 

(b) Householder's method: n+0(n) 3 Givens's: 2n°+0(n) . 
(c) PyA, _{P, = (I-2v, vi) (A,_y-2wp vy) « 


(a) For each (j,i,j- 1) the ith and jth rows and columns are 

affected; for (j,i,j-1) need 4(n-j)+15 operations (using symmetry) ; 

sum from i=j+1 to n,j=2 to n-1 to get result; one square root needed 

for each (j,1i,j-1). 

(b) M=(n-k+2) (n-k+1) multiplications for Whi for Why and Vay + 

n*/2 each prus O(n) using symmetry; sum from k=2 to n-1. 

(k) ,2 
} 


(c) Inv vi every element contains factor of (vy :; therefore, 


kk 
need only [vk }? ; still true if scheme of example 10.7 used. 


vi?) = 22975, "O) 2.97326; w or 25688, -.01357, -1.88908); 


q5= (-. 25688, -.43526, -.10271); Ay by rows 1.0, 1.11804, -.00002; 


1.11804, 1.40000, -.55004; -.00002, -.55004, 1.60024. 
(a) Diagonal terms after five rotations: A,: 9.6055522, 


1 
1.9999997, 2.3944482; As: -3.5994607, 8.8698995, 4.7295617; 


results for A, better because diagonal of A, is more dominant than 


1 1 
A (b) A,:b =7, b,=60/13, b,=31/13, c,=713, Co=1/13; 


2° 1 
_\3. 2 nc. . - = 
fy (A)=A THA °+47A-46; Ao: b,=142/25, c,=5, 


C5=81/25; f(A) =A7-1047-704151, (c) Same as part b except for 


roundoff. 


(a) Express A, in terms of vy. and Ay _4 


(b) 5n /3 multiplications and additions; n-2 square roots. 


(a) b4=2.9, b 8, b3=.2; c,=.3125, Co=- 36; fo Os X 3243, 6875) 


27! 3. !2 3° 
+.0625. (b) If Xs=¥_,, n'+0(n"); if xFAV4" 2n +0 (n? ). 


(c)  x5=(11/2,-5/6,1/2) , y5=(3/5,1/5,-4/15) ; x5Y5=0 so method fails. 


1 
b,=3, b,=33/25, 


and compare with (10.3-45). 


(4) b,=2.00415, b, =1.98629; c,=.23235, cy=.76750: 
f(A) =97-6A7 4111-6. 


=2.00956, b, 


(a) A4=3, vj=(3/2,1,3/8) 7 Ag=-2, Vg=(-6,1,3/7); Agate V4=(3,1,3/2). 


1 2 
(b) yf=(1,1,1), yg=(4 5/7,3/7,-6), yt=(-1,1,1). 
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37. 


38. 


39. 


40. 
41. 


42. 


43. 


45. 


46. 


47. 


T = ut = vt = yt 
(by ¥54q% = ¥yPy (APs (A)x, = yyP5_ (APS (AK, = Y5Xi44- 
i i 
T —~ ate . T _ . ; 
(c) ¥i41%i41 7 (A Yj ny C4 4Y5) (Ax, ny C4 5%5) take partial 


with respect to Ci5- (ad) If no v3 or Xs equals zero the result 


is immediate. (e) Yiet%iad = 


Let sv4X Xi take partial with respect to Cis and use part a. (f) Use 
equations in part a. 


(a) Matrix to interchange columns p and q, C=([c, ig? c,,=1, ifp,.q, 


= =1, ..= i : i to 
Cog Cap Ci5 0 otherwise; matrix to add m times column p 


column q, M=[m..]J; m,j=1, Mogi™: m,5=0 otherwise. (b) C =C, 


except Nog? (c) For row k there are n-k-1 


T 
Y; Th AX, -Ee 5Y5 sAX, “te; ij yzAx 3* 


..d3 =m. . 
row and’ colum, additions each requring 2n-2k-1 multiplications; 
sum (n-k-1) (2n-2k-1) from k=1 to n-2; for Householder's method 
asymmetry doubles a number of operations. 

(b) If vi is an eigenvector of Bo, then (0" Vv a) is an eigenvector 


of B; if we is an eigenvector of B, then (we 1Z Ty is an eigenvector 


1 
of B where z is a solution of (Bo -AI) z=Bow where A is the eigen- 
value of B, corresponding to w. 

Differentiate (10.4-16). 


(a) No, all that is assured is existence of orthogonal matrix Q. 


(k) 7p, (k-1)_, (k-1) , (k-1)_., 2 
Fe cee eee ee Pe acinaen 1/2 
top ; tan 6 = (top -tog t{{th) -tag } ~4tog top } ) 


joege (c) Diagonal elements: 3.0000028, .9999707, -1.9999739. 


(a) A = U*DU,DD* = pD*pD. 
(b) TT* = T*~T, Equate the diagonal elements of these matrices. 


(a) In A = LU, uz45=0, j-i >m, 1,.=0, i-j > m; therefore, UL has 


1) 
all elements with |i-j]| > m equal to zero. (b) If A=QU has 
elements with i-j > m equal to zero, then elements of Q with 
i-j > m are zero; thus elements of UQ with i-j > m are zero; since 


Ay 4 =O) Ay, 0, each AL is symmetric. (c) Each Q is an upper 
Hessenberg matrix. 

-1 _ ot 
Aya AF, (ARUP, TIF, + PY T= FLO AQ FY 


Ey My = By (Ao Py TA = By Ay Ayn Pe Eqeg Baad = 
(A TE, 4 Hyg: 
i-1 
a Premultiplyin nm si 
(a) Pay 7 sey 3,3+1,5 


1 Pk 


(A, - Py I) by si itt,i requires 


i-1 


about 4(n-i) multiplications, and postmultiplying R, sat S541, 3 
by Si i+1,4 requires about 4i multiplications. 
Let Ry have ry on diagonal, qy on first superdiagonal, t, on second; 


r=cos 8.p5+sin O5b.; q,=cos e. cos 6,_,b;+sin Ors? 
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48. 


49, 


50. 
51. 
52. 


53. 


tj=sin re let Arey have ae!) on diagonal byt) on 
subdiagonal; aj**) =cos 85H, cos O,rj+sin oe ptt) = O5rs iy 


1: After 10 iterations eigenvalues are: 9.60555, 2.19353, 
2.20091; after 25 iterations: 9.60555, 2.39261, 2.00184. 

A,: After 10 iterations: 8.86988, 4.46548, -3.33536; after 

25 iterations: 8.86990, 4.72949, -3.59939. 

After 25 iterations; 9.60555, 2.38835, 2.00610; 


A.: After 25 iterations: 8.86990, 4.72164, -3.59154. 
Particular case of (10.5-20) and (10.5-21). 
(a) (Podicg = Sg if k < j or & < j. 


(a) Cf. Prob. 21b. 
(b) Apply the Cauchy-Schwarz inequality to y and Ax. 


(a) Let x=La5V; i v; are orthonormal set of eigenvectors; get 
e*=E]a,-a]7]a,]7; assume result false and get contradiction. 


(b) r'=(0,€) but true eigenvectors are (1,1), (1,71). 
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